The project is used to predict the sentiment ( positive, negative, neutral ) on real-time tweets, read using the tweepy API with a python Producer process. The tweets are pushed into a Kinesis data stream from where a python Consumer process reads them and invokes the Tensorflow Serving API to predict the sentiment. The TensorFlow Serving application uses a BERT model and a classifier layer on top to make predictions. The response predictions are displayed on a Streamlit dashboard to depict trends. The Twitter topic to be used when pulling tweets can be configured, along with the number of tweets to be pulled, at a time. The raw tweets and the predictions are stored on AWS S3 as JSON files.
- Process Flow
- Data Collection, Preparation,
- Model selection, Model training
- Installation
- Project Organization
- Credits
- License
Datasets used:
- Sentiment140 Dataset Details
Source : http://help.sentiment140.com/for-students
Description: The training data was automatically created, as opposed to having humans manual annotate tweets. In the approach used, any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search. This is described in the following paper(https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf) The data is a CSV with emoticons removed. - Twitter Airline Sentiment Dataset
Source -https://www.kaggle.com/crowdflower/twitter-airline-sentiment
Description :This data originally came from Crowdflower's Data for Everyone library. As the original source says,a sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").
The final model used , is trained in the following notebook which uses BERT+ classification layer to predict Positive,Negetive and Neutral sentiment, trained on the airline tweet sentiment dataset.
This used the BERT model from the tensorflow hub:
Notebook for Analysis using BERT+classifier for Postive ,Negetive and Neutral sentiment
Other models that were considered are as follows:
-
The previous model used , is trained in the following notebook which uses BERT+ classification layer to predict Positive,and Negetive sentiment trained on the twitter 140 sentiment data:
Notebook for initial Analysis -
Analysis , model selection using tensorflow (multiple ways of reading input)
Notebook for Analysis using BERT+classifier for Postive and Negetive sentiment -
Analysis using Hugging Face transformers model with tensorflow:
Notebook for Hugging Face for Postive and Negetive sentiment -
Analysis using Hugging Face transformers model with tensorflow on Google Colab:
Colab Notebook for Hugging Face for Postive and Negetive sentiment -
Analysis using Doc2Vec with gensim
Notebook for Analysis using doc2vec
Clone the repository
git clone git@github.com:agvar/Deep_Learning_Text.git
To install docker: https://docs.docker.com/engine/install/
The folder structure for the saved tensorflow models on S3 are as follows:
├── dataset20200101projectfiles
├── models
├── sentiment_model
├──1 <- first sentiment model for only positive and negetive predictions
├──assets
├──vocab.txt
├──keras_metadata.pb
├──saved_model.pb
├──variables
├──variables.data-00000-of-00001
├──variables.index
├──2 <- latest sentiment model for positive,negetive and neutral predictions
├──assets
├──vocab.txt
├──keras_metadata.pb
├──saved_model.pb
├──variables
├──variables.data-00000-of-00001
├──variables.index
Create a models folder for the model on local
mkdir models
Download the model from the s3 location to the models folder on local
aws s3 cp s3://dataset20200101projectfiles/models/ . --recursive
docker run -p 8501:8501 --name sentiment_model --mount type=bind,source=./Deep_Learning_Text/deep_learning_DS/models/sentiment_model,target=/models/sentiment_model -e MODEL_NAME=sentiment_model -t tensorflow/serving:2.8.0
Login to AWS and create an EC2 instance.
The one that was used for the project was a Deep Learning AMI GPU TensorFlow 2.10.0 on Amazon Linux, which had most of the deep learning libraries and docker pre-installed.
Connect to the instance using the EC2 key-pair
ssh -i "path/.ssh/<ec2keypair>" ec2-user@<public dns>
Update the packages on the EC2 instance
-[ec2-user ~]$ sudo yum update -y
If Docker is not installed on the EC2 machine, follow the below installation steps:
[ec2-user ~]$ sudo yum install docker -y
Start the Docker Service on EC2
[ec2-user ~]$ sudo service docker start
Pull docker image for TensorFlow serving
sudo docker pull tensorflow/serving:2.8.0
(note: When trying to use the latest docker image-tensorflow/serving:latest - a memory error was encountered)
Grant permission to access s3 from ec2:
https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-access-s3-bucket/
Copy the model from s3 to ec2
aws s3 cp s3://dataset20200101projectfiles/models/ . --recursive
(Note: When trying to access the s3 location of the saved model,we encountered an error with tensorflow having access issues with S3)
Run the following Docker container run command , using 8501 as the port
docker run -p 8501:8501 --name sentiment_model --mount type=bind,source=//e//python_projects/Deep_Learning_Text/deep_learning_DS/models/sentiment_model,target=/models/sentiment_model -e MODEL_NAME=sentiment_model -t tensorflow/serving:2.8.0
Add port 8501 in the security group for inbound traffic from local machine
pip install -r requirements.txt
Modify the ./twitter_streaming/twitter_streaming.ini file as needed to update the following:
stream_filter
->set the filter on tweets to be processed
stream_language
-> sets the tweet language to look for
file_max_tweet_limit
-> maximum of tweets to be processed into a single json file
collect_max_tweet_limt
-> maximum of tweets to be processed on a single run.
To execute Producer process
python ./twitter_streaming/twitter_streaming/producer/twitter_stream_message_producer.py
To execute Consumer process
python ./twitter_streaming/twitter_streaming/consumer/twitter_stream_message_consumer.py
The consumer process reads tweets from Kinesis and calls the tensorFlow api to make predictions and stores them on s3.
For installation steps for Streamlit on windows -> Streamlit installation
For windows,in the Anaconda Navigator terminal that appears :
streamlit run dashboard.py
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── visualization <- Streamlist dashboard folder
├── Deep_Learning_DS <- Python notebooks ,models folder
├── twitter_streaming <- consumer and producer modules for reading from tweepy and writing to Kineses, S3
└── images <- images,diagrams for the project
https://ileriayo.github.io/markdown-badges/
https://docs.docker.com/engine/install/