Skip to content

Boykadakim/User-Clustering-with-BERT-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

User Clustering with BERT Models 🌟

Welcome to the User Clustering Pipelines with BERT Models on Long and Heterogeneous Tweets repository. This project is part of a Bachelor of Science thesis that explores user clustering techniques using advanced BERT models. We aim to provide insights into clustering methodologies that leverage the power of natural language processing.

Releases

Table of Contents

  1. Introduction
  2. Installation
  3. Usage
  4. Clustering Techniques
  5. BERT Models
  6. Dataset
  7. Results
  8. Contributing
  9. License
  10. Contact

Introduction

In the age of social media, understanding user behavior is crucial. This project utilizes BERT models to cluster users based on their tweet content. By analyzing long and heterogeneous tweets, we can identify patterns and group users with similar interests. This repository contains the code, data, and documentation necessary to replicate our findings.

Installation

To get started, clone the repository:

git clone https://github.com/Boykadakim/User-Clustering-with-BERT-Models.git
cd User-Clustering-with-BERT-Models

Next, install the required dependencies. You can do this using pip:

pip install -r requirements.txt

Make sure you have Python 3.6 or higher installed. You may also need to install additional libraries based on your environment.

Usage

To run the user clustering pipeline, execute the following command:

python main.py

This command will initiate the clustering process. You can adjust parameters in the configuration file to fine-tune the clustering algorithms used.

For more detailed instructions, check the Releases section for downloadable files that contain example scripts and data.

Clustering Techniques

This project implements several clustering techniques:

  • K-Means Clustering: A popular method for partitioning data into K distinct groups.
  • DBSCAN: A density-based clustering algorithm that can find clusters of varying shapes and sizes.
  • HDBSCAN: An extension of DBSCAN that handles varying densities.
  • Agglomerative Clustering: A hierarchical clustering method that builds a tree of clusters.

Each method has its strengths and is suited for different types of data distributions.

Clustering Workflow

  1. Data Preprocessing: Clean and prepare the tweet data for analysis.
  2. Feature Extraction: Use BERT embeddings to convert tweets into numerical vectors.
  3. Clustering: Apply one or more clustering algorithms to group users based on their tweet content.
  4. Evaluation: Assess the quality of the clusters using metrics like silhouette score and Davies-Bouldin index.

BERT Models

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language representation model. It captures the context of words in a sentence, making it ideal for understanding the nuances of language in tweets.

Using BERT for Clustering

  1. Pre-trained Models: We use pre-trained BERT models available from the Hugging Face Transformers library.
  2. Fine-tuning: Depending on your specific dataset, you may want to fine-tune the BERT model for better performance.
  3. Embedding Extraction: Convert tweets into embeddings that can be used for clustering.

Dataset

The dataset consists of tweets collected from Twitter. It includes a diverse range of topics and user interactions. The data is cleaned and preprocessed to remove noise and irrelevant information.

Data Sources

  • Twitter API: Tweets are collected using the Twitter API.
  • Public Datasets: Additional datasets may be used for validation and testing.

Results

After executing the clustering algorithms, you will obtain clusters of users based on their tweet content. Visualizations can help interpret the results. We recommend using UMAP for dimensionality reduction to visualize high-dimensional embeddings.

Example Results

  • Cluster 1: Users interested in technology and programming.
  • Cluster 2: Users focused on sports and fitness.
  • Cluster 3: Users discussing politics and current events.

Contributing

We welcome contributions to improve this project. Please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Commit your changes and push to your branch.
  4. Open a pull request for review.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or suggestions, please contact:

Thank you for your interest in the User Clustering with BERT Models project! We hope you find it useful and informative. For more information, please visit the Releases section.

About

User Clustering Pipelines with BERT Models on Long and Heterogeneous Tweets - BSc Thesis

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •