October 8th, 2021
Improving NLP Model Performance with Context-Aware Feature ExtractionRSS Share Category: H2O AI Hybrid Cloud, NLP, Technical Posts
By: Jo-Fai Chow
I would like to share with you a simple yet very effective trick to improve feature engineering for text analytics. After reading this article, you will be able to follow the exact steps and try it yourself using our H2O AI Hybrid Cloud.
First of all, let’s have a look at the off-the-shelf natural language processing (NLP) recipes in H2O Driverless AI (one of our AI Cloud’s AutoML products). We have some standard text transformation recipes like Term Frequency-Inverse Document Frequency (TF-IDF) as well as some complex ones like Convolutional Neural Network (CNN), Bi-directional Gated Recurrent Unit (BiGRU), and Bidirectional Encoder Representations from Transformers (BERT). You can find the full list of available text transformers here.
Off-the-shelf NLP recipes in H2O Driverless AI
So, in other words, we already have many general-purpose NLP recipes to cover the most common text analytics use cases. But we don’t stop right there. We know that it is possible to further improve predictive performance with smart and, more importantly, domain-specific feature extraction. That’s why we make the NLP capabilities in Driverless AI extensible via custom recipes. We can leverage state-of-the-art NLP models from the research community and perform context-aware feature extraction with minimal effort in Driverless AI.
Let me show you how.
A Quick Tutorial – Airline Twitter Sentiment
The Airline Twitter Sentiment dataset was scraped in 2015 and contributors were asked to classify positive, negative, and neutral tweets. You can find out more about the dataset and download it from here. Out of the 20 columns available in the dataset, we are only interested in
text (the single feature) and
airline_sentiment (the target).
Airline Twitter Sentiment Dataset
Step 1 – Split the Data
Follow these steps to import the airline dataset into Driverless AI. Since the Airline Twitter Sentiment dataset is just a single CSV without a dedicated test dataset, we can split the dataset into
airline_test using the dataset splitter as shown below.
Dataset splitter interface in Driverless AI
Step 2 – Build a Baseline Model
Now we are ready to train our first model using
airline_train and then evaluate the out-of-bag performance with
airline_test. For the first baseline model, we are going to leave most settings as default. Since we are only using the text column as a single feature for this exercise, we need to remove the rest (see dropped columns settings below) before we launch the experiment.
Driverless AI model training settings for the baseline model
Remember to drop everything but
text in dropped columns setting
As we haven’t switched on complex text transformation (e.g. CNN, BiGRU, BERT), the transformed features from this simple experiment are all TF-IDF-based. We can certainly improve this baseline model with more complex transformation so let’s move on to the next step.
The most important features for the baseline model are TF-IDF-based word embeddings
Step 3 – Improve the Baseline with CNN and BiGRU Feature Transformation
In order to switch on more complex text transformation, we need to change two values in expert settings as shown below. This will activate word-based CNN and BiGRU text transformation in the automatic feature engineering pipeline. As a result, we can see that the dominant features in the experiment are created based on CNN and BiGRU (instead of TF-IDF-based features in the baseline model). We can also see an improvement in model performance (i.e. lower logloss and error rate). Can we further improve this? Read on.
Enable word-based CNN and BiGRU models in NLP expert settings
New Features from CNN and BiGRU lead to better predictive performance
Enter the Hugging Face Model Hub
Before we get to the next step, let me introduce a fantastic platform called Hugging Face. Here is the statement on their website:
“We are helping the community work together towards the goal of advancing Artificial Intelligence 🔥. Not one company, even the Tech Titans, will be able to “solve AI” by themselves – the only way we’ll achieve this is by sharing knowledge and resources. On the Hugging Face Hub we are building the largest collection of models, datasets and metrics in order to democratize and advance AI for everyone 🚀. The Hugging Face Hub works as a central place where anyone can share and explore models and datasets.” (Source)
For our Airline Twitter Sentiment exercise, we are going to find a relevant transformer on Hugging Face so that we can perform better feature extraction than those from the general-purpose text transformers in Driverless AI.
Find out more on Hugging Face’s website
Step 4 – Find a Domain-Specific Transformer
From a quick search on Hugging Face using the keyword
twitter-roberta-base-sentiment model from Cardiff NLP group. The model was trained on many different tweets. That sounds relevant to our use case here so let’s give it a try!
Searching for domain-specific models on Hugging Face
Example outputs of the twitter-roberta-base-sentiment model that can be used as new features
Step 5 – Extract Context-Aware Features with the Twitter-Roberta-based Transformer
Now, this is the most important step. If you get this right, you will be able to import many more transformers from Hugging Face.
First, we need to write a simple Python script that imports the
twitter-roberta-base-sentiment transformer into Driverless AI. Let’s call this script
TwitterRobertaTransformer.py. The most important parameters in this script are
class. Replace them with other transformers from Hugging Face and you will be able to import many other transformers into Driverless AI.
from h2oaicore.systemutils import config from h2oaicore.transformer_utils import CustomTransformer from h2oaicore.transformers_nlp import BERTTransformer MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment' class TwitterRoberta(BERTTransformer, CustomTransformer): _mojo = False def get_default_properties(): return dict(col_type="text", min_cols=1, max_cols=1, relative_importance=1) def get_parameter_choices(): return dict(model_type=[MODEL_NAME], batch_size=[config.pytorch_nlp_fine_tuning_batch_size], seq_length=[config.pytorch_nlp_fine_tuning_padding_length] )
Once we have the script ready, we can go to the recipes tab in expert settings and upload the script as shown below. You will also need to enable it by selecting
TwitterRoberta in the specific transformers setting. After that, you should be able to see
TwitterRoberta in the feature engineering search space.
Adding new feature transformation via custom recipe
Twitter-Roberta-based transformation is now available for the feature engineering pipeline
As expected, we can get better predictive performance with domain-specific features from the
Twitter-Roberta-based features further improve the predictive performance
In short, we start with a simple baseline model using the standard text transformations like TF-IDF and then improve the performance with CNN/BiGRU feature transformations. In order to perform context-aware and domain-specific feature extraction, we import the
twitter-roberta-base-sentiment transformer and further improve the model performance.
Comparing model performance based on various text transformations
(score = logloss, lower = better)
Your Turn to Try!
It is possible to improve the model even further (see screenshot below). I am not going to reveal the exact procedure but I am sure you can figure it out fairly quickly. Here are a few hints:
- Can we switch on other BERT transformers that come with Driverless AI?
- What if we try different accuracy/time/interpretability settings? This leaderboard feature may help.
- Can we mix and match other transformers from Hugging Face?
Mix and match different text transformers. Yes, you can do better than this!
With custom recipes, it is possible to extend and improve text transformation in Driverless AI using state-of-the-art models from the AI community. Thus, we already have the technology in place to future-proof our automatic feature engineering pipeline. We are excited to see what our users can do with different transformers. For example, could you extract predictive features with BioBert for health care use cases? Could you get a competitive edge in the stock market with features from FinBert? The possibilities are endless. We hope that our technology will enable our users to benefit from the latest transformers with minimal effort for many years to come.
How to Get Started?
H2O AI Hybrid Cloud is the best way to get free, hands-on experience. No installation. All you need is a web browser. Start your free trial today.