January 31, 2018

Sentiment Analysis on Streaming Data

This is the second article of a series of blogs that demonstrates the power of Xcalar Data Platform. Here is a link to my first article Part 1 where I used publicly available San Francisco employee salary data for ad hoc analytics.

For this exercise I will showcase Xcalar’s ability to work with streaming data, by performing a sentiment analysis on twitter data that determines the public’s reaction to an Emmy Award show.

For this blog post, I worked with Twitter data, using Twitter's public API https://developer.twitter.com/en/docs, and applied a Natural-language processing (NLP) API called Dandelion from https://dandelion.eu/ using Python. Dandelion offers a variety of text analytics services, including text classification, semantic similarity, and sentiment analysis. I used the sentiment analysis service. Since the basic Dandelion plan only allows for the processing of up to 1000 units per day, I worked with only 99 tweets for this blog post.

Twitter provides two main types of APIs:
  1. Streaming API: This API allows for a live stream of data from Twitter. It is useful for performing real-time data analytics on Twitter data as an event is occurring in the real-world.
  2. REST API: This API provides access to read and write Twitter data. It is useful for reading past tweets/user data and creating new tweets.
I used a Python library for accessing the Twitter REST API from Tweepy at: https://github.com/tweepy/tweepy

SET UP

Before I could start, I had to create the following:
  1. A Twitter Dev account https://developer.twitter.com/en/community
  2. A new Twitter app at https://apps.twitter.com/
  3. A Dandelion account https://dandelion.eu/accounts/register/?next=/

STEP 1: IMPORTING TWEETS INTO XCALAR

The following Python script uses Tweepy to query for tweets containing the words Emmy Award. Authentication keys and secrets from both your Twitter dev account and App are necessary to access Twitter's API.
After creating an import user-defined function (UDF) in Xcalar’s integrated Jupyter Notebook editor, I copied it into a new module called twitterpy  and ran the script during dataset import in Xcalar Design:
Figure 1: Calling Twitter REST API during dataset  import
The following table was created:
Figure 2: Table produced by importing Twitter feeds
I noticed that each tweet is split into several columns, which will be useful to gain other insights:
  1. Id
  2. Created_at
  3. Text
  4. Created_at
  5. Favorite_count
  6. retweet_count

STEP 2: Using Dandelion

Now that we have our data imported into Xcalar, lets use Dandelion to perform a sentiment analysis.

The following python script makes use of the Dandelion API by taking in an input string and returning a sentiment value in the range [-1, 1]. Where:
  1. -1 is an extremely negative sentiment
  2. 0 is neutral
  3. 1 is strongly positive
Similar to the import UDF, I created a Map UDF from this Python script. Once I copied the UDF into the dandelionpy module, I applied the API using Xcalar’s Map operation:

After running the Map operation (as shown in Figure 3) to apply the Map UDF on the text column in the tweets table:
Figure 3: Running my Dandelion API call through Map
A new column called sentiment was created, as shown in Figure 4 below:
Figure 4: Dandelion sentiment scores

STEP 3: Visualizing the Data

I used the Profile option on the sentiment column to display a bar graph along with some statistics regarding the data:
Figure 4: Dandelion sentiment scores
Performing a few quick filter operations shows us there were 52 positive tweets (score > 1), 35 neutral tweets (score = 0), and 12 negative tweets (score < 0).

The average sentiment score of the 99 Tweets we pulled from twitter is .289, a slightly positive score.

conclusion

  1. Based on the analysis performed on the data set, the majority of the tweets contained positive language and therefore positive reviews in regards to the Emmy Award show.
  2. The setup of Tweepy and Dandelion and its use with Xcalar was a fairly straight forward process and opens the door for many possibilities.
It is important to note that while the Dandelion's sentiment analysis is mostly correct in its interpretation of the positive/negative nature of language, it can misinterpret some sarcasm as positive language. Since our small tweet sample contained very little sarcastic language, the results could be considered accurate.
This concludes my second Xcalar Blog entry, stay tuned for more complex examples to come.
For more detailed instructions on how to do the tasks in Xcalar Design that are described in this blog, see Xcalar Design Help documentation.
Dan Kaper
Back to Posts