October 8th, 2021

Improving NLP Model Performance with Context-Aware Feature Extraction

RSS icon RSS Category: H2O AI Hybrid Cloud, NLP, Technical Posts

I would like to share with you a simple yet very effective trick to improve feature engineering for text analytics. After reading this article, you will be able to follow the exact steps and try it yourself using our H2O AI Hybrid Cloud.

First of all, let’s have a look at the off-the-shelf natural language processing (NLP) recipes in H2O Driverless AI (one of our AI Cloud’s AutoML products). We have some standard text transformation recipes like Term Frequency-Inverse Document Frequency (TF-IDF) as well as some complex ones like Convolutional Neural Network (CNN), Bi-directional Gated Recurrent Unit (BiGRU), and Bidirectional Encoder Representations from Transformers (BERT). You can find the full list of available text transformers here.

Off-the-shelf NLP recipes in H2O Driverless AI

So, in other words, we already have many general-purpose NLP recipes to cover the most common text analytics use cases. But we don’t stop right there. We know that it is possible to further improve predictive performance with smart and, more importantly, domain-specific feature extraction. That’s why we make the NLP capabilities in Driverless AI extensible via custom recipes. We can leverage state-of-the-art NLP models from the research community and perform context-aware feature extraction with minimal effort in Driverless AI.

Let me show you how.

A Quick Tutorial – Airline Twitter Sentiment

The Airline Twitter Sentiment dataset was scraped in 2015 and contributors were asked to classify positive, negative, and neutral tweets. You can find out more about the dataset and download it from here. Out of the 20 columns available in the dataset, we are only interested in text (the single feature) and airline_sentiment (the target).

Airline Twitter Sentiment Dataset

Step 1 – Split the Data

Follow these steps to import the airline dataset into Driverless AI. Since the Airline Twitter Sentiment dataset is just a single CSV without a dedicated test dataset, we can split the dataset into airline_train and airline_test using the dataset splitter as shown below.

Dataset splitter interface in Driverless AI

Step 2 – Build a Baseline Model

Now we are ready to train our first model using airline_train and then evaluate the out-of-bag performance with airline_test. For the first baseline model, we are going to leave most settings as default. Since we are only using the text column as a single feature for this exercise, we need to remove the rest (see dropped columns settings below) before we launch the experiment.

Driverless AI model training settings for the baseline model

Remember to drop everything but text in dropped columns setting

As we haven’t switched on complex text transformation (e.g. CNN, BiGRU, BERT), the transformed features from this simple experiment are all TF-IDF-based. We can certainly improve this baseline model with more complex transformation so let’s move on to the next step.

The most important features for the baseline model are TF-IDF-based word embeddings

Step 3 – Improve the Baseline with CNN and BiGRU Feature Transformation

In order to switch on more complex text transformation, we need to change two values in expert settings as shown below. This will activate word-based CNN and BiGRU text transformation in the automatic feature engineering pipeline. As a result, we can see that the dominant features in the experiment are created based on CNN and BiGRU (instead of TF-IDF-based features in the baseline model). We can also see an improvement in model performance (i.e. lower logloss and error rate). Can we further improve this? Read on.

Enable word-based CNN and BiGRU models in NLP expert settings

New Features from CNN and BiGRU lead to better predictive performance

Enter the Hugging Face Model Hub

Before we get to the next step, let me introduce a fantastic platform called Hugging Face. Here is the statement on their website:

“We are helping the community work together towards the goal of advancing Artificial Intelligence 🔥. Not one company, even the Tech Titans, will be able to “solve AI” by themselves – the only way we’ll achieve this is by sharing knowledge and resources. On the Hugging Face Hub we are building the largest collection of models, datasets and metrics in order to democratize and advance AI for everyone 🚀. The Hugging Face Hub works as a central place where anyone can share and explore models and datasets.” (Source)

For our Airline Twitter Sentiment exercise, we are going to find a relevant transformer on Hugging Face so that we can perform better feature extraction than those from the general-purpose text transformers in Driverless AI.

Find out more on Hugging Face’s website

Step 4 – Find a Domain-Specific Transformer

From a quick search on Hugging Face using the keyword twitter, we can find the twitter-roberta-base-sentiment model from Cardiff NLP group. The model was trained on many different tweets. That sounds relevant to our use case here so let’s give it a try!

Searching for domain-specific models on Hugging Face

Example outputs of the twitter-roberta-base-sentiment model that can be used as new features

Step 5 – Extract Context-Aware Features with the Twitter-Roberta-based Transformer

Now, this is the most important step. If you get this right, you will be able to import many more transformers from Hugging Face.

First, we need to write a simple Python script that imports the twitter-roberta-base-sentiment transformer into Driverless AI. Let’s call this script TwitterRobertaTransformer.py. The most important parameters in this script are MODEL_NAME and class. Replace them with other transformers from Hugging Face and you will be able to import many other transformers into Driverless AI.

from h2oaicore.systemutils import config
from h2oaicore.transformer_utils import CustomTransformer
from h2oaicore.transformers_nlp import BERTTransformer

MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment'

class TwitterRoberta(BERTTransformer, CustomTransformer):
    _mojo = False

    @staticmethod
    def get_default_properties():
        return dict(col_type="text",
                    min_cols=1,
                    max_cols=1,
                    relative_importance=1)

    @staticmethod
    def get_parameter_choices():
        return dict(model_type=[MODEL_NAME],
                    batch_size=[config.pytorch_nlp_fine_tuning_batch_size],
                    seq_length=[config.pytorch_nlp_fine_tuning_padding_length]
                    )

Once we have the script ready, we can go to the recipes tab in expert settings and upload the script as shown below. You will also need to enable it by selecting TwitterRoberta in the specific transformers setting. After that, you should be able to see TwitterRoberta in the feature engineering search space.

Adding new feature transformation via custom recipe

Twitter-Roberta-based transformation is now available for the feature engineering pipeline

As expected, we can get better predictive performance with domain-specific features from the twitter-roberta-base-sentiment model.

Twitter-Roberta-based features further improve the predictive performance

Quick Recap

In short, we start with a simple baseline model using the standard text transformations like TF-IDF and then improve the performance with CNN/BiGRU feature transformations. In order to perform context-aware and domain-specific feature extraction, we import the twitter-roberta-base-sentiment transformer and further improve the model performance.

Comparing model performance based on various text transformations
(score = logloss, lower = better)

Your Turn to Try!

It is possible to improve the model even further (see screenshot below). I am not going to reveal the exact procedure but I am sure you can figure it out fairly quickly. Here are a few hints:

  • Can we switch on other BERT transformers that come with Driverless AI?
  • What if we try different accuracy/time/interpretability settings? This leaderboard feature may help.
  • Can we mix and match other transformers from Hugging Face?
Mix and match different text transformers. Yes, you can do better than this!

Key Takeaways

With custom recipes, it is possible to extend and improve text transformation in Driverless AI using state-of-the-art models from the AI community. Thus, we already have the technology in place to future-proof our automatic feature engineering pipeline. We are excited to see what our users can do with different transformers. For example, could you extract predictive features with BioBert for health care use cases? Could you get a competitive edge in the stock market with features from FinBert? The possibilities are endless. We hope that our technology will enable our users to benefit from the latest transformers with minimal effort for many years to come.

How to Get Started?

H2O AI Hybrid Cloud is the best way to get free, hands-on experience. No installation. All you need is a web browser. Start your free trial today.

Credits

The advanced text analytics feature discussed in this article is brought to you by Sudalai Rajkumar, Maximilian Jeblick, and Trushant Kalyanpur.

About the Author

Jo-Fai Chow

Jo-fai (or Joe) has multiple roles (data scientist / evangelist / community manager) at H2O.ai. Since joining the company in 2016, Joe has delivered H2O talks/workshops in 40+ cities around Europe, US, and Asia. Nowadays, he is best known as the H2O #360Selfie guy. He is also the co-organiser of H2O's EMEA meetup groups including London Artificial Intelligence & Deep Learning - one of the biggest data science communities in the world with more than 11,000 members.

Leave a Reply

New Features Now Available with the Latest Release of the H2O AI Hybrid Cloud 21.10

The Makers here at H2O.ai have been busy building new features and enhancing capabilities across

October 18, 2021 - by
Time Series Forecasting Best Practices

Earlier this year, my colleague Vishal Sharma gave a talk about time series forecasting best

October 15, 2021 - by Jo-Fai Chow
Feature Transformation with the H2O AI Hybrid Cloud

It is well known throughout the data science community that data preparation, pre-processing, and feature

October 7, 2021 - by Benjamin Cox
Introducing DatatableTon – Python Datatable Tutorials & Exercises

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data

September 20, 2021 - by Rohan Rao
H2O Release 3.34 (Zizler)

There’s a new major release of H2O, and it’s packed with new features and fixes!

September 15, 2021 - by Michal Kurka
From the game of Go to Kaggle: The story of a Kaggle Grandmaster from Taiwan

In conversation with Kunhao Yeh: A Data Scientist and Kaggle Grandmaster In these series of interviews,

September 13, 2021 - by Parul Pandey

Start your 14-day free trial today