A few weeks back at ODSC (Open Data Science Conference) in SF, a lot of folks asked us if we could help them build and operationalize ML to perform sentiment analysis on social data, product review data, customer satisfaction.
Why should our customers have all the fun? We decided to build and run our own Sentiment Analysis Dataflow on Impeachment Twitter (There is probably no more controversial or polarizing a topic than impeachment).
STEP 1: CAPTURE TWEETS
We wrote a simple python script that pulls tweets from the Twitter API. The resulting tweets are stored in JSON form and collected in Amazon S3 buckets (a directory for each of the 9 days of this experiment).
Over the 9-10 days of this experiment, we captured and processed 3.5 Million Tweets.
STEP 2: PRE-PROCESSING
If only the world's data was standardized and could read our minds. But until such a technology and company are funded, we need to prepare our tweets before we can apply any NLP sentiment analysis. A screen shot of dataflow that execute this preparation is shown below.
The 8 pre-processing steps that we performed are as follows:
1) Rename Nested Location
The Tweet location was a nested field within the JSON file. Using Xcalar, we expanded the JSON file, picked out the "location" field and expressed it within all the JSON records.
After expressing it, we renamed it from user.location to just "location"
2) Strip Out Day of Week
Twitter captures the tweet created date as
with a day of the week at the start. We wanted to strip that day of the week out for date/time simplicity.
3) Keep Only Needed Fields and Take out Null Values
So for this project, we only needed three key fields: tweet, tweet_date, and location. So we used a SQL statement to only keep these three fields and also filtered out all tweets that were blank.
4) Determine if the Tweet Is a Retweet + 5) Remove Retweets
One of the advantages of Xcalar is its ability to build your dataflows on the actual big data set. Why is this important? Because you can see anomalies in the data immediately and can adjust the data flow. In this case, we noticed that there were a lot of duplicate ReTweets in the data set. We ran a quick profile of the data....
And we discovered that of the 178,000 tweets, 83% were retweets and NOT original commentary. We then made a decision that we didn't want to have artificial and lazy sentiment amplification, and so we filtered out all tweets that were retweets.
6,7,8) Determine PRO/ANTI Lexicon Count and only Apply NLP to Those with at Least 1 Word in Either Lexicon
Sentiment Analysis simply determines whether or not the English text is POSITIVE, NEGATIVE, NEUTRAL or MIXED and gives a numerical rating; but it doesn't have any context of how it will be used. In the interest of time and simplicity, we wanted to apply a lexicon based classification to the resulting sentiment.
To do so, we crafted two sets of lexicon:
- words associated with PRO-IMPEACHMENT sentiment
- words associated with ANTI-IMPEACHMENT
But we didn't want to hard code this into the code, as over the course of the project, we added/removed and changed this list as events played out.
So to make this as flexible as possible, we implemented the PRO-IMPEACHMENT and ANTI-IMPEACHMENT lexicon as parameters within Xcalar.
One we had these two lexicons, we wrote 2 short Python functions to count how many words from each lexicon appeared in the tweet.
STEP 3: AMAZON COMPREHEND
While many of our customers use Sci-Kit Learn and NLTK for ML and sentiment analysis, we wanted something simpler - we just needed basic English-language sentiment analysis and who better to do such language processing than the makers of Amazon Alexa?
We knew that Amazon had a much larger training data set, so we didn't want to reinvent the wheel, and so we chose to use Amazon Comprehend which is a natural language processing (NLP) service that uses machine learning to determine sentiment.
These 3 lines were all we needed to access Comprehend web service and get back POSITIVE, NEGATIVE, NEUTRAL an MIXED sentiment results.
CAVEAT: While the Amazon Comprehend API was super convenient and easy-to-implement, it was also VERY slow to execute About 1 minute for every 1,000 records.
Below is a sample of the sentiment that resulted from Amazon Comprehend. You'll see that it does a pretty, accurate job of determining sentiment. But as you can see from the red and blue highlighted tweets, it doesn't do a great job at providing context. And this is why we need to do POST-PROCESSING.
So this concludes the pre-processing + NLP dataflow. We repeated this 9 times - one for each S3 bucket that contains the tweets from that day.
Now that we have run Sentiment Analysis on all the tweets, we need to run post-processing to achieve three primary things:
- Apply Lexicon classification to the Positive and Negative sentiment (this is the MOST important)
- Calculate averages over the 3-hour time segments
- Process the data for Tableau (clean-it up and filter to a manageable data set size)
STEP 4: POST-PROCESSING - Lexicon classification
After combining the results from all 9 days, we need to start classifying the sentiments using the ANTI and PRO words. The word associated with the ANTI and PRO Impeachment sides were passed through via a parameter and included the words in the following word salads.
Without applying a lexicon weight to the sentiment, the sentiments are inaccurate. Coincidentally, when we sorted for the most positive tweets, the top two results were 99% positive, but they could not be more opposite in opinion.
So what do we do?
So we apply lexicon classification to the sentiment.
For PRO IMPEACHMENT classification:
- POSITIVE SENTIMENT * # of words in PRO IMPEACHMENT word salad +
("I love Nancy Pelosi")
- NEGATIVE SENTIMENT * # of words in ANTI IMPEACH MENT word salad
("I hate Donald Trump")
- and put it into a PRO IMPEACHMENT field
For ANTI IMPEACHMENT classification:
- POSITIVE SENTIMENT * # of words in ANTI IMPEACHMENT word salad +
("I love Donald Trump")
- NEGATIVE SENTIMENT * # of words in PRO IMPEACHMENT word salad
("I hate Nancy Pelosi")
- and put it into an ANTI IMPEACHMENT field
You can argue our lexicon classification, but you can't deny that it works!
So now when we sort for the most PRO IMPEACHMENT tweet you get this result, note that the tweets are now correctly classified as PRO IMPEACHMENT
If we do the same thing, and sort by the MOST ANTI_IMPEACHMENT_SENTIMENT.
STEP 4: POST-PROCESSING - preparing for Tableau visualization
So the final step of this analysis was to visualize it within Tableau. But before we can connect to Tableau, we needed to two two main things:
- Aggregate the data by 3-hour time segments and calculate the average PRO IMPEACHMENT and ANTI IMPEACHMENT sentiment.
- Clean-up location so that Tableau can read the data (Tableau doesn't like emojis in location)
- Filter the data so that there is no more than 50,000 rows (Tableau works better when the size of data is under 100,000 rows)
STEP 5: Visualization
We directly connected the results of our sentiment analysis to Tableau for visualization. Here are the results.
This was Tweets plotted against location for November 20. The Red circles represent strong ANTI-IMPEACHMENT sentiment while the blue circles represent strong PRO-IMPEACHMENT sentiment. The various shades of light blue/red represent areas where the sentiment was not as strong/more neutral. The distribution of red and blue dots is for the most part what we expected (strong PRO-IMPEACHMENT in San Francisco, LA, NY) and ANTI-IMPEACHMENT in Mid-West, Florida and Texas. The odd one out was the strong ANTI-IMPEACHMENT in Portland Oregon!
Here we plot out the average PRO ANTI IMPEACHMENT sentiment over the course of the last 10 days.
This was a pretty ambitious webinar project - but we were confident going in that it was do-able in the short-time frame that we had. And the reason was that we used ONE SINGLE PLATFORM for the entire thing, soup-to-nuts.
- data ingestion (virtually from S3)
- data pre-processing for ML
- executing of Sentiment ML within the dataflow
- post-processing of lexicon classification
- direct data access to Tableau
No where in this process did we have to stitch together another tool - this is impossible outside of Xcalar. Even within the broad Amazon services ecosystem there is quite substantial interim data staging, importing of data between products.
So you made it this far in this super long blog post! We are impressed and would love to setup your own instance where you can see how we did this and try for yourself. Drop us a note at firstname.lastname@example.org!