August 5th, 2019

Detecting Sarcasm is difficult, but AI may have an answer

Category: Driverless AI, NLP, Recipes, Technical, Tutorials

Recently, while shopping for a laptop bag, I stumbled upon a pretty amusing customer review:

“This is the best laptop bag ever. It is so good that within two months of use, it is worthy of being used as a grocery bag.”

The innate sarcasm in the review is evident as the user isn’t happy with the quality of the bag. However, as the sentence contains words like ‘best’, ‘good’ and ‘worthy’, the review can easily be mistaken to be positive. It is a common phenomenon for such humorous albeit cryptic reviews to become viral on social media. If such responses are not detected and acted upon, it may prove to be damaging for a company’s reputation, especially if they are planning to hold a new launch. Detecting sarcasm in the reviews is an important use case of Natural Language Processing, and we shall see how Driverless AI can help us in this regard.

Sentiment Analysis: eliciting vital insights from unstructured data

Source: 5 ways sentiment analysis can boost your business

Before we get into the nitty-gritty of sarcasm detection, let’s try and have a holistic overview of Sentiment Analysis.

Sentiment analysis, also known as opinion mining is a sub-field of Natural Language Processing (NLP) that tries to identify and extract opinions from the text. Earlier, companies relied on traditional methods like survey and focus group studies on getting consumer’s feedback. However, Machine Learning and Artificial Intelligence backed technologies have made it possible to analyse text from a wide variety of sources with a lot more accuracy and less effort. Needless to say, the ability to extract emotions from text is a very valuable tool that has the potential to improve the ROI of a lot of businesses.

Importance of Sentiment Analysis

 

Advantages of Sentiment Analysis in driving Business

Paul Hoffman, the CTO of Space-Time Insight, once said, “If you want to understand people, especially your customers…then you have to be able to possess a strong capability to analyse text”. We couldn’t agree more with Paul since the power that text analysis brings to businesses has been quite evident in recent years. With a surge in social media activities, emotions are seen as valuable commodities from a business perspective. By carefully gauging people’s opinion and sentiments, companies can reasonably figure out what people think about a product and accordingly incorporate feedbacks.

Sarcasm: Negative sentiment using Positive words

Sentiment analysis is not an easy task to perform. Text data often comes pre-loaded with a lot of noise. Sarcasm is one such type of noise innately present in social media and product reviews which may interfere with the results.

Sarcastic texts demonstrate a unique behavior. Unlike a simple negation, a sarcastic sentence conveys a negative sentiment using only positive connotation of words. Here are a few examples where sarcasm is pretty evident.

Sentiment analysis can easily be misled by the presence of such sarcastic words and hence, sarcasm detection is a vital preprocessing step in many NLP tasks. It is useful to identify and get rid of the noisy samples before training models for NLP applications.

Sarcasm detection using Driverless AI (DAI)

Driverless AI comes equipped with Natural Language Processing (NLP) recipes for text classification and regression problems. The platform supports both standalone text and text with other numerical values as predictive features. The following recipes and models have been implemented in DAI:

Driverless AI automatically converts text strings into features using powerful techniques like TFIDF, CNN, and GRU. With TensorFlow, Driverless AI can also process larger text blocks and build models using all available data to solve business problems. Driverless AI has state of the art NLP capabilities for Sentiment analysis, and we shall utilise it to build a Sarcasm detection classifier.

The dataset consists of 1.3 million Sarcastic comments from the Internet commentary website Reddit, labelled as sarcastic and non-sarcastic. The source of the dataset is a paper titled: “A Large Self-Annotated Corpus for Sarcasm”. A processed version of the dataset can also be found on Kaggle, Let’s explore the dataset before running the various classification algorithms.

Importing the data

The dataset consists of a million rows and each record consist of ten attributes:

We are mainly interested in the following two columns:

  • label : 0 for sarcastic comment and 1 for non-sarcastic comment
  • comment: The text column which will be used for running the experiment

Exploratory data analysis

The dataset is perfectly balanced, with an equal number of sarcastic and non-sarcastic tweets.

The distribution of lengths for sarcastic and normal comments is also almost the same.

Distribution of Sarcastic vs Non-Sarcastic Comments

Since the dataset has been converted into a tabular format, it is ready to be fed into Driverless AI. Note that text features will be automatically generated and evaluated during the feature engineering process

Launching the Experiment

We shall launch our experiment in three parts to get the best possible results.

  • With built-in TF/IDF NLP recipes

In the first part, we shall use the built-in TF/IDF capabilities of DAI.

In case you want to refresh your knowledge about getting started with Driverless AI, feel free to take a Test Drive.Test Drive is H2O’s Driverless AI on the AWS Cloud where you can explore all its features without having to download it.

Start a fresh instance of DAI. Next, split the dataset into training and testing sets in 70:30 ratio and specify label as the target column. We shall also deselect all the other columns and retain only the comment column in our dataset. Finally, select LogLoss as the scorer keeping all the other parameters as default and launch the experiment. The screen should appear as follows:

Sentiment Analysis with built-in NLP recipes
  • With built-in Tensorflow NLP recipes

As an alternative, we will launch another instance of the same experiment, but with Tensorflow models. This is done since TextCNN relies on TensorFlow models. Click on the ‘Expert Settings’ tab and switch on ‘TensorFlow Models’. Rest of the process remains the same.

Sentiment Analysis with built-in Tensorflow recipes
  • With Custom Sentiment Recipes

If the built-in recipes aren’t sufficient, it may be worth building our own recipe that is focused on our specific use case. The latest version(1.7.0) of DAI implements a key feature called BYOR which stands for Bring Your Own Recipes. This feature has been designed to enable Data Scientists to customise the DAI as per their business needs. You can read more about this feature here.

To upload a custom recipe, Go to the expert settings and upload the desired recipe. H2O has built and open-sourced more than 80 recipes which can be used as templates. These recipes can be accessed from https://github.com/h2oai/driverlessai-recipes. For this experiment, we shall use the following recipe:

TextBlob is a python library and offers a simple API to access its methods and perform basic NLP tasks. It can perform a lot of NLP tasks like sentiment analysis, spell check, summary creation, translation etc. Click on the expert settings’ tab and navigate to the driverlessai-recipes > transformers > nlpand select the desired recipe. Click save to save the settings.

Next, you can also select specific transformers and deselect the rest.

Experiment Results Summary

The screenshot below shows the comparison between the three instances of DAI with different recipes. The inclusion of a custom recipe reduced the Logloss component from 0.54 to 0.50, which, when translated to a business domain, can have immense value.

Once the experiment is done, users can make new predictions and download the scoring pipeline, just like any other Driverless AI experiments.

Conclusion

Sentiment analysis can play a crucial role in the marketing domain. It can help to create targeted brand messages and assist a company in understanding consumer’s preferences. These insights could be critical for a company to increase its reach and influence across a range of sectors.

About the Author

Parul Pandey

Leave a Reply

From Academia to Kaggle and H2O.ai: How a Physicist found love in Data Science

Learning and taking inspirations from others is always helpful. It makes even more sense in

September 16, 2019 - by Parul Pandey
Regression Metrics’ Guide

Introduction As part of my role within the automated machine learning space with H2O.AI and Driverless AI,

September 9, 2019 - by Marios Michailidis
Series ‘D’emocratize

Last month was very emotional for me and I suspect it was the same for

September 7, 2019 - by Thomas Ott
Driverless AI can help you choose what you consume next

Steve Jobs once said, “A lot of times, people don’t know what they want until

September 6, 2019 - by Parul Pandey
The Wall Street Journal Captures the Essence of H2O.ai

Adam Janofsky at the Wall Street Journal wrote a wonderful article about our company, and

September 5, 2019 - by Ingrid Burton
Predicting Failures from Sensor Data using AI/ML— Part 1

Whether it’s healthcare, manufacturing or anything that we depend on either personal or in business,

August 26, 2019 - by Karthik Guruswamy

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img