September 12th, 2018

Automatic Feature Engineering for Text Analytics – The Latest Addition to Our Kaggle Grandmasters’ Recipes

RSS icon RSS Category: Data Science, Driverless AI, GPU, NLP

According to Kaggle’s ‘The State of Machine Learning and Data Science’ survey, text data is the second most used data type at work for data scientists. There are a lot of interesting text analytics applications like sentiment prediction, product categorization, document classification and so on.

In the latest version (1.3) of our Driverless AI platform, we have included Natural Language Processing (NLP) recipes for text classification and regression problems. Our platform has the ability to support both standalone text and text with other numerical values as predictive features. In particular, we have implemented the following recipes and models:

– **Text-specific feature engineering recipes**:
– TFIDF, Frequency of n-grams
– Truncated SVD
– Word embeddings

– **Text-specific models to extract features from text**:
– Convolutional neural network models on word embeddings
– Linear models on TFIDF vectors

A Typical Example: Sentiment Analysis

Let us illustrate the usage with a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’ column for this demo. Here are some samples from the dataset:

Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems in the Driverless AI setup, we need to choose the dataset and then specify the target column (‘airline_sentiment’).

Since there are other columns in the dataset, we need to click on ‘Dropped Cols’ and then exclude everything but ‘text’ as shown below:

Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to ‘Expert Settings’ and switch on ‘TensorFlow Models’.

At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.

Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Bonus fact #1: The masterminds behind our NLP recipes are Sudalai Rajkumar (aka SRK) and Dmitry Larko.

Bonus fact #2: Don’t want to use the Driverless AI GUI? You can run the same demo using our Python API. See this example notebook.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,
SRK and Joe

About the Authors

Jo-Fai Chow

Jo-fai (or Joe) has multiple roles (data scientist / evangelist / community manager) at H2O.ai. Since joining the company in 2016, Joe has delivered H2O talks/workshops in 40+ cities around Europe, US, and Asia. Nowadays, he is best known as the H2O #360Selfie guy. He is also the co-organiser of H2O's EMEA meetup groups including London Artificial Intelligence & Deep Learning - one of the biggest data science communities in the world with more than 11,000 members.

Sudalai Rajkumar

Sudalai Rajkumar (aka SRK) is a Data Scientist at H2O.ai Inc, building Driverless AI, an automated machine learning platform. Prior to this, he was with Freshworks, Tiger Analytics and Global Analytics. He has solved a lot of interesting data science problems for various customers across the globe in multiple domains including finance, e-commerce, online advertising, health care, transportation, retail. He has worked on varied problems ranging from doing simple analysis on structured data to natural language processing and voice analytics in his career. Apart from his day job, he takes part in various data science competitions to enhance his knowledge and has won several of them. He is a Kaggle Grandmaster in the Competitions & Kernels section.

Leave a Reply

What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow
Regístrese para su prueba gratuita y podrá explorar H2O AI Hybrid Cloud

Recientemente, lanzamos nuestra prueba gratuita de 14 días de H2O AI Hybrid Cloud, lo que

May 17, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today