User Clustering with BERT Models 🌟

Welcome to the User Clustering Pipelines with BERT Models on Long and Heterogeneous Tweets repository. This project is part of a Bachelor of Science thesis that explores user clustering techniques using advanced BERT models. We aim to provide insights into clustering methodologies that leverage the power of natural language processing.

Introduction

In the age of social media, understanding user behavior is crucial. This project utilizes BERT models to cluster users based on their tweet content. By analyzing long and heterogeneous tweets, we can identify patterns and group users with similar interests. This repository contains the code, data, and documentation necessary to replicate our findings.

Installation

To get started, clone the repository:

git clone https://github.com/Boykadakim/User-Clustering-with-BERT-Models.git
cd User-Clustering-with-BERT-Models

Next, install the required dependencies. You can do this using pip:

pip install -r requirements.txt

Make sure you have Python 3.6 or higher installed. You may also need to install additional libraries based on your environment.

Usage

To run the user clustering pipeline, execute the following command:

python main.py

This command will initiate the clustering process. You can adjust parameters in the configuration file to fine-tune the clustering algorithms used.

For more detailed instructions, check the Releases section for downloadable files that contain example scripts and data.

Clustering Techniques

This project implements several clustering techniques:

K-Means Clustering: A popular method for partitioning data into K distinct groups.
DBSCAN: A density-based clustering algorithm that can find clusters of varying shapes and sizes.
HDBSCAN: An extension of DBSCAN that handles varying densities.
Agglomerative Clustering: A hierarchical clustering method that builds a tree of clusters.

Each method has its strengths and is suited for different types of data distributions.

Clustering Workflow

Data Preprocessing: Clean and prepare the tweet data for analysis.
Feature Extraction: Use BERT embeddings to convert tweets into numerical vectors.
Clustering: Apply one or more clustering algorithms to group users based on their tweet content.
Evaluation: Assess the quality of the clusters using metrics like silhouette score and Davies-Bouldin index.

BERT Models

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language representation model. It captures the context of words in a sentence, making it ideal for understanding the nuances of language in tweets.

Using BERT for Clustering

Pre-trained Models: We use pre-trained BERT models available from the Hugging Face Transformers library.
Fine-tuning: Depending on your specific dataset, you may want to fine-tune the BERT model for better performance.
Embedding Extraction: Convert tweets into embeddings that can be used for clustering.

Dataset

The dataset consists of tweets collected from Twitter. It includes a diverse range of topics and user interactions. The data is cleaned and preprocessed to remove noise and irrelevant information.

Data Sources

Twitter API: Tweets are collected using the Twitter API.
Public Datasets: Additional datasets may be used for validation and testing.

Results

After executing the clustering algorithms, you will obtain clusters of users based on their tweet content. Visualizations can help interpret the results. We recommend using UMAP for dimensionality reduction to visualize high-dimensional embeddings.

Example Results

Cluster 1: Users interested in technology and programming.
Cluster 2: Users focused on sports and fitness.
Cluster 3: Users discussing politics and current events.

Contributing

We welcome contributions to improve this project. Please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Commit your changes and push to your branch.
Open a pull request for review.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or suggestions, please contact:

Author: Your Name
GitHub: Boykadakim

Thank you for your interest in the User Clustering with BERT Models project! We hope you find it useful and informative. For more information, please visit the Releases section.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
CSF.ipynb		CSF.ipynb
Clustering.ipynb		Clustering.ipynb
DBCV [30].py		DBCV [30].py
LICENSE		LICENSE
README.md		README.md
Random_selection.ipynb		Random_selection.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

User Clustering with BERT Models 🌟

Table of Contents

Introduction

Installation

Usage

Clustering Techniques

Clustering Workflow

BERT Models

Using BERT for Clustering

Dataset

Data Sources

Results

Example Results

Contributing

License

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Boykadakim/User-Clustering-with-BERT-Models

Folders and files

Latest commit

History

Repository files navigation

User Clustering with BERT Models 🌟

Table of Contents

Introduction

Installation

Usage

Clustering Techniques

Clustering Workflow

BERT Models

Using BERT for Clustering

Dataset

Data Sources

Results

Example Results

Contributing

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages