Skip to content

The project predicts the sentiment ( positive ,negative, neutral ) on real-time tweets using Tweepy, AWS Kinesis Streaming, S3, Tensorflow Serving, BERT

Notifications You must be signed in to change notification settings

agvar/Deep_Learning_Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Jupyter Notebook scikit-learn AWS TensorFlow Keras

Twitter Sentiment Prediction (Deep Learning)

Description

The project is used to predict the sentiment ( positive, negative, neutral ) on real-time tweets, read using the tweepy API with a python Producer process. The tweets are pushed into a Kinesis data stream from where a python Consumer process reads them and invokes the Tensorflow Serving API to predict the sentiment. The TensorFlow Serving application uses a BERT model and a classifier layer on top to make predictions. The response predictions are displayed on a Streamlit dashboard to depict trends. The Twitter topic to be used when pulling tweets can be configured, along with the number of tweets to be pulled, at a time. The raw tweets and the predictions are stored on AWS S3 as JSON files.

Table of Contents

Process Flow

Architecture Diagram

Data Collection, Preparation

Datasets used:

  1. Sentiment140 Dataset Details
    Source : http://help.sentiment140.com/for-students
    Description: The training data was automatically created, as opposed to having humans manual annotate tweets. In the approach used, any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search. This is described in the following paper(https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) The data is a CSV with emoticons removed.
  2. Twitter Airline Sentiment Dataset
    Source -https://www.kaggle.com/crowdflower/twitter-airline-sentiment
    Description :This data originally came from Crowdflower's Data for Everyone library. As the original source says,a sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

Model selection, training

The final model used , is trained in the following notebook which uses BERT+ classification layer to predict Positive,Negetive and Neutral sentiment, trained on the airline tweet sentiment dataset. This used the BERT model from the tensorflow hub:
Notebook for Analysis using BERT+classifier for Postive ,Negetive and Neutral sentiment

Other models that were considered are as follows:

  1. The previous model used , is trained in the following notebook which uses BERT+ classification layer to predict Positive,and Negetive sentiment trained on the twitter 140 sentiment data:
    Notebook for initial Analysis

  2. Analysis , model selection using tensorflow (multiple ways of reading input)
    Notebook for Analysis using BERT+classifier for Postive and Negetive sentiment

  3. Analysis using Hugging Face transformers model with tensorflow:
    Notebook for Hugging Face for Postive and Negetive sentiment

  4. Analysis using Hugging Face transformers model with tensorflow on Google Colab:
    Colab Notebook for Hugging Face for Postive and Negetive sentiment

  5. Analysis using Doc2Vec with gensim
    Notebook for Analysis using doc2vec

Installation

Clone the repository
git clone git@github.com:agvar/Deep_Learning_Text.git

Running the Tensor Flow model on local

To install docker: https://docs.docker.com/engine/install/

Download the saved tensorflow model from s3 to local.

The folder structure for the saved tensorflow models on S3 are as follows:

├── dataset20200101projectfiles
    ├── models
        ├── sentiment_model
            ├──1                                     <- first sentiment model for only positive and negetive  predictions
                ├──assets
                   ├──vocab.txt
                ├──keras_metadata.pb
                ├──saved_model.pb
                ├──variables
                   ├──variables.data-00000-of-00001
                   ├──variables.index
            ├──2                                      <- latest sentiment model for positive,negetive and neutral predictions
                ├──assets
                   ├──vocab.txt
                ├──keras_metadata.pb
                ├──saved_model.pb
                ├──variables
                   ├──variables.data-00000-of-00001
                   ├──variables.index

Create a models folder for the model on local
mkdir models
Download the model from the s3 location to the models folder on local
aws s3 cp s3://dataset20200101projectfiles/models/ . --recursive

docker run -p 8501:8501 --name sentiment_model --mount type=bind,source=./Deep_Learning_Text/deep_learning_DS/models/sentiment_model,target=/models/sentiment_model -e MODEL_NAME=sentiment_model -t tensorflow/serving:2.8.0

Moving the Tensor Flow model to an EC2 machine and runing it with docker

Login to AWS and create an EC2 instance.
The one that was used for the project was a Deep Learning AMI GPU TensorFlow 2.10.0 on Amazon Linux, which had most of the deep learning libraries and docker pre-installed.

Connect to the instance using the EC2 key-pair
ssh -i "path/.ssh/<ec2keypair>" ec2-user@<public dns>

Update the packages on the EC2 instance
-[ec2-user ~]$ sudo yum update -y
If Docker is not installed on the EC2 machine, follow the below installation steps:
[ec2-user ~]$ sudo yum install docker -y
Start the Docker Service on EC2
[ec2-user ~]$ sudo service docker start

Pull docker image for TensorFlow serving
sudo docker pull tensorflow/serving:2.8.0
(note: When trying to use the latest docker image-tensorflow/serving:latest - a memory error was encountered)

Grant permission to access s3 from ec2:
https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-access-s3-bucket/

Copy the model from s3 to ec2
aws s3 cp s3://dataset20200101projectfiles/models/ . --recursive (Note: When trying to access the s3 location of the saved model,we encountered an error with tensorflow having access issues with S3)

Run the following Docker container run command , using 8501 as the port

docker run -p 8501:8501 --name sentiment_model --mount type=bind,source=//e//python_projects/Deep_Learning_Text/deep_learning_DS/models/sentiment_model,target=/models/sentiment_model -e MODEL_NAME=sentiment_model -t tensorflow/serving:2.8.0

Add port 8501 in the security group for inbound traffic from local machine

Install the AWS Kinesis consumer and producer

pip install -r requirements.txt

Modify the ./twitter_streaming/twitter_streaming.ini file as needed to update the following:
stream_filter ->set the filter on tweets to be processed
stream_language -> sets the tweet language to look for
file_max_tweet_limit -> maximum of tweets to be processed into a single json file
collect_max_tweet_limt -> maximum of tweets to be processed on a single run.

To execute Producer process
python ./twitter_streaming/twitter_streaming/producer/twitter_stream_message_producer.py

To execute Consumer process python ./twitter_streaming/twitter_streaming/consumer/twitter_stream_message_consumer.py The consumer process reads tweets from Kinesis and calls the tensorFlow api to make predictions and stores them on s3.

To run the dashboard using Streamlit

For installation steps for Streamlit on windows -> Streamlit installation For windows,in the Anaconda Navigator terminal that appears : streamlit run dashboard.py

Project Organization

├── LICENSE
├── README.md                  <- The top-level README for developers using this project.
├── visualization              <- Streamlist dashboard folder
├── Deep_Learning_DS           <- Python notebooks ,models folder
├── twitter_streaming          <- consumer and producer modules for reading from tweepy and writing to Kineses, S3
└── images                     <- images,diagrams for the project

Credits

https://ileriayo.github.io/markdown-badges/
https://docs.docker.com/engine/install/

License

License

About

The project predicts the sentiment ( positive ,negative, neutral ) on real-time tweets using Tweepy, AWS Kinesis Streaming, S3, Tensorflow Serving, BERT

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published