September 12th, 2018

Automatic Feature Engineering for Text Analytics – The Latest Addition to Our Kaggle Grandmasters’ Recipes

Category: Data Science, Driverless AI, GPU, NLP
Fallback Featured Image

According to Kaggle’s ‘The State of Machine Learning and Data Science’ survey, text data is the second most used data type at work for data scientists. There are a lot of interesting text analytics applications like sentiment prediction, product categorization, document classification and so on.

In the latest version (1.3) of our Driverless AI platform, we have included Natural Language Processing (NLP) recipes for text classification and regression problems. Our platform has the ability to support both standalone text and text with other numerical values as predictive features. In particular, we have implemented the following recipes and models:

– **Text-specific feature engineering recipes**:
– TFIDF, Frequency of n-grams
– Truncated SVD
– Word embeddings

– **Text-specific models to extract features from text**:
– Convolutional neural network models on word embeddings
– Linear models on TFIDF vectors

A Typical Example: Sentiment Analysis

Let us illustrate the usage with a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’ column for this demo. Here are some samples from the dataset:

Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems in the Driverless AI setup, we need to choose the dataset and then specify the target column (‘airline_sentiment’).

Since there are other columns in the dataset, we need to click on ‘Dropped Cols’ and then exclude everything but ‘text’ as shown below:

Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to ‘Expert Settings’ and switch on ‘TensorFlow Models’.

At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.

Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Bonus fact #1: The masterminds behind our NLP recipes are Sudalai Rajkumar (aka SRK) and Dmitry Larko.

Bonus fact #2: Don’t want to use the Driverless AI GUI? You can run the same demo using our Python API. See this example notebook.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,
SRK and Joe

About the Author

Jo Fai Chow
Jo-Fai Chow

Jo-fai (Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media where he developed data products to enable quick and smart business decisions. He also worked (part-time) for Domino Data Lab as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD researcher at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in UK and abroad. He holds a MSc in Environmental Management and a BEng in Civil Engineering. Long before Joe immersed himself in the wonderful world of open-source R and Python, he learned his trade as an avid MATLAB user. When he was a kid, his parents taught him one of the famous old Chinese sayings – when one drinks water, one must not forget where it comes from. So when Twitter asked Joe to be creative, he simply put down @matlabulous as his handle. In 2014, his data visualization side project ‘CrimeMap’ led him to a poster presentation at useR! 2014 where he heard about H2O for the very first time. He has been using H2O for various data science projects ever since.

Leave a Reply

Fallback Featured Image
Key Takeaways from the Forrester Notebook Wave

The Forrester Wave: Notebook-Based Predictive Analytics and Machine Learning Solutions, Q3 2018 is out, and

September 7, 2018 - by Vinod Iyengar
Fallback Featured Image
H2O for Inexperienced Users

Some background: I am a rising senior in highschool, and the summer of 2018, I

August 24, 2018 - by Abhay Singhal
Fallback Featured Image
Interpretability: The missing link between machine learning, healthcare, and the FDA?

Recent advances enable practitioners to break open machine learning’s “black box”. From machine learning algorithms guiding

August 23, 2018 - by Andrew Langsner and Patrick Hall
Ice cream banner
The different flavors of AutoML

In recent years, the demand for machine learning experts has outpaced the supply, despite the

August 15, 2018 - by Erin LeDell
Fallback Featured Image
H2O’s AutoML in Spark

This blog post demonstrates how H2O’s powerful automatic machine learning can be used together with

July 23, 2018 - by Jakub Hava
FfDl cloud hardware
H2O-3 on FfDL: Bringing deep learning and machine learning closer together

This post originally appeared in the IBM Developer blog here. This post is co-authored by Animesh

June 25, 2018 - by Vinod Iyengar

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img