October 17th, 2016
Creating a Binary Classifier to Sort Trump vs. Clinton Tweets Using NLPRSS Share Category: Community, Data Journalism, Flow, Python
The problem: Can we determine if a tweet came from the Donald Trump Twitter account (@realDonaldTrump) or the Hillary Clinton Twitter account (@HillaryClinton) using text analysis and Natural Language Processing (NLP) alone?
The Solution: Yes! We’ll divide this tutorial into three parts, the first on how to gather the necessary data, the second on data exploration, munging, & feature engineering, and the third on building our model itself. You can find all of our code on GitHub (https://git.io/vPwxr).
Part One: Collecting the Data
Note: We are going to be using Python. For the R version of this process, the concepts translate, and we have some code on Github that might be helpful. You can find the notebook for this part as “TweetGetter.ipynb” in our GitHub repository: https://git.io/vPwxr.
We used the Twitter API to collect tweets by both presidential candidates, which would become our dataset. Twitter only lets you access the latest ~3000 or so tweets from a particular handle, even though they keep all the Tweets in their own databases.
The first step is to create an app on Twitter, which you can do by visiting https://apps.twitter.com/. After completing the form you can access your app, and your keys and tokens. Specifically we’re looking for four things: the client key and secret (called consumer key and consumer secret) and the resource owner key and secret (called access token and access token secret).
We save this information in JSON format in a separate file
Then, we can use the Python libraries Requests and Pandas to gather the tweets into a DataFrame. We only really care about three things: the author of the Tweet (Donald Trump or Hillary Clinton), the text of the Tweet, and the unique identifier of the Tweet, but we can take in as much other data as we want (for the sake of data exploration, we also included the timestamp of each Tweet).
Once we have all this information, we can output it to a .csv file for further analysis and exploration.
Part Two: Data Cleaning and Munging
You can find the notebook for this part as “NLPAnalysis.ipynb” in our GitHub repository: https://git.io/vPwxr.
To fully take advantage of machine learning, we need to add features to this dataset. For example, we might want to take into account the punctuation that each Twitter account uses, thinking that it might be important in helping us discriminate between Trump and Clinton. If we take the amount of punctuation symbols in each Tweet, and take the average across all Tweets, we get the following graph:
Or perhaps we care about how many hashtags or mentions each account uses:
With our timestamp data, we can examine Tweets by their Retweet count, over time:
The tall blue skyscraper was Clinton’s “Delete Your Account” Tweet
We can also compare the distribution of Tweets over time. We can see that Clinton tweets more frequently than Trump (this is also evidenced by us being able to access older Tweets from Trump, since there’s a hard limit on the number of Tweets we can access).
The Democratic National Convention was in session from July 25th to the 28th
We can construct heatmaps of when these candidates were posting:
Heatmap of Trump Tweets, by day and hour
All this light analysis was useful for intuition, but our real goal is to only use the text of the tweet (including derived features) for our classification. If we included features like the time-stamp, it would become a lot easier.
We can utilize a process called tokenization, which lets us create features from the words in our text. To understand why this is useful, let’s pretend to only care about the mentions (for example, @h2oai) in each tweet. We would expect that Donald Trump would mention certain people (@GovPenceIN) more than others and certainly different people than Hillary Clinton. Of course, there might be people both parties tweet at (maybe @POTUS). These patterns could be useful in classifying Tweets.
Now, we can apply that same line of thinking to words. To make sure that we are only including valuable words, we can exclude stop-words which are filler words, such as ‘and’ or ‘the.’ We can also use a metric called term frequency – inverse document frequency (TF-IDF) that computes how important a word is to a document.
There are also other ways to use and combine NLP. One approach might be sentiment analysis, where we interpret a tweet to be positive or negative. David Robinson did this to show that Trump’s personal tweets are angrier, as opposed to those written by his staff.
Another approach might be to create word trees that represent sentence structure. Once each tweet has been represented in this format you can examine metrics such as tree length or number of nodes, which are measures of the complexity of a sentence. Maybe Trump tweets a lot of clauses, as opposed to full sentences.
Part Three: Building, Training, and Testing the Model
You can find the notebooks for this part as “Python-TF-IDF.ipynb” and “TweetsNLP.flow” in our GitHub repository: https://git.io/vPwxr.
There were a lot of approaches to take but we decided to keep it simple for now by only using TF-IDF vectorization. The actual code writing was relatively simple thanks to the excellent Scikit-Learn package alongside NLTK.
We could have also done some further cleaning of the data, such as excluding urls from our Tweets text (right now, strings such as “zy7vpfrsdz” get their own feature column as the NLTK vectorizer treats them as words). Our not doing this won’t affect our model as the urls are unique, but it might save on space and time. Another strategy could be to stem words, treating words as their root (so ‘hearing’ and ‘heard’ would both be coded as ‘hear’).
Still, our model (created using H2O Flow) produces quite a good result without those improvements. We can use a variety of metrics to confirm this, including the Area Under the Curve (AUC). The AUC measures the True Positive Rate (tpr) versus the False Negative Rate (fpr). A score of 0.5 means that the model is equivalent to flipping a coin, and a score of 1 means that the model is 100% accurate.
The model curve is blue, while the red curve represents 50–50 guessing
For a more intuitive judgement of our model we can look at the variable importances of our model (what the model considers to be good discriminators of the data) and see if they make sense:
Can you guess which words (variables) correspond (are important) to which candidate?
Maybe the next step could be to build an app that will take in text and output if the text is more likely to have come from Clinton or Trump. Perhaps we can even consider the Tweets of several politicians, assign them a ‘liberal/conservative’ score, and then build a model to predict if a Tweet is more conservative or more liberal (important features would maybe include “Benghazi” or “climate change”). Another cool application might be a deep learning model, in the footsteps of @DeepDrumpf.
If this inspired you to create analysis or build models, please let us know! We might want to highlight your project ??.