Welcome to the User Clustering Pipelines with BERT Models on Long and Heterogeneous Tweets repository. This project is part of a Bachelor of Science thesis that explores user clustering techniques using advanced BERT models. We aim to provide insights into clustering methodologies that leverage the power of natural language processing.
- Introduction
- Installation
- Usage
- Clustering Techniques
- BERT Models
- Dataset
- Results
- Contributing
- License
- Contact
In the age of social media, understanding user behavior is crucial. This project utilizes BERT models to cluster users based on their tweet content. By analyzing long and heterogeneous tweets, we can identify patterns and group users with similar interests. This repository contains the code, data, and documentation necessary to replicate our findings.
To get started, clone the repository:
git clone https://github.com/Boykadakim/User-Clustering-with-BERT-Models.git
cd User-Clustering-with-BERT-Models
Next, install the required dependencies. You can do this using pip:
pip install -r requirements.txt
Make sure you have Python 3.6 or higher installed. You may also need to install additional libraries based on your environment.
To run the user clustering pipeline, execute the following command:
python main.py
This command will initiate the clustering process. You can adjust parameters in the configuration file to fine-tune the clustering algorithms used.
For more detailed instructions, check the Releases section for downloadable files that contain example scripts and data.
This project implements several clustering techniques:
- K-Means Clustering: A popular method for partitioning data into K distinct groups.
- DBSCAN: A density-based clustering algorithm that can find clusters of varying shapes and sizes.
- HDBSCAN: An extension of DBSCAN that handles varying densities.
- Agglomerative Clustering: A hierarchical clustering method that builds a tree of clusters.
Each method has its strengths and is suited for different types of data distributions.
- Data Preprocessing: Clean and prepare the tweet data for analysis.
- Feature Extraction: Use BERT embeddings to convert tweets into numerical vectors.
- Clustering: Apply one or more clustering algorithms to group users based on their tweet content.
- Evaluation: Assess the quality of the clusters using metrics like silhouette score and Davies-Bouldin index.
BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language representation model. It captures the context of words in a sentence, making it ideal for understanding the nuances of language in tweets.
- Pre-trained Models: We use pre-trained BERT models available from the Hugging Face Transformers library.
- Fine-tuning: Depending on your specific dataset, you may want to fine-tune the BERT model for better performance.
- Embedding Extraction: Convert tweets into embeddings that can be used for clustering.
The dataset consists of tweets collected from Twitter. It includes a diverse range of topics and user interactions. The data is cleaned and preprocessed to remove noise and irrelevant information.
- Twitter API: Tweets are collected using the Twitter API.
- Public Datasets: Additional datasets may be used for validation and testing.
After executing the clustering algorithms, you will obtain clusters of users based on their tweet content. Visualizations can help interpret the results. We recommend using UMAP for dimensionality reduction to visualize high-dimensional embeddings.
- Cluster 1: Users interested in technology and programming.
- Cluster 2: Users focused on sports and fitness.
- Cluster 3: Users discussing politics and current events.
We welcome contributions to improve this project. Please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Commit your changes and push to your branch.
- Open a pull request for review.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or suggestions, please contact:
- Author: Your Name
- GitHub: Boykadakim
Thank you for your interest in the User Clustering with BERT Models project! We hope you find it useful and informative. For more information, please visit the Releases section.