August 15th, 2018

The different flavors of AutoML

Category: AutoML, Data Science, Driverless AI, H2O
Ice cream banner

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software (e.g. H2O, scikit-learn, keras). Although these tools have made it easy to train and evaluate machine learning models, there is still a good amount of data science knowledge that’s required in order to create the highest-quality model, given your dataset. Writing the code to perform a hyperparameter search over many different types of algorithms can also be time consuming and repetitive work.

What is AutoML?

The term “AutoML” (Automatic Machine Learning) refers to automated methods for model selection and/or hyperparameter optimization. AutoML is also a subfield of machine learning that has a rich academic history, an annual workshop at the International Conference on Machine Learning (ICML), and academic research labs devoted to this topic (e.g. University of Freiburg Machine Learning Lab in Germany).

The AutoML field began by developing methods for automating hyperparameter optimization in single models, and now includes such techniques as automated stacking (ensembles), neural architecture search, pipeline optimization and feature engineering.

AutoML Tools

The goal of AutoML software is two-fold:

  1. To enable non-experts to train high quality machine learning models.
  2. To improve the efficiency of finding optimal solutions to machine learning problems.

There are a handful of different AutoML platforms (open source, closed source and SaaS), aiming to solve different types of supervised machine learning problems. AutoML tools can largely be categorized by use-case or more simply, by the format of the training data.

  • IID tabular data (numeric and/or categorical data)
  • Time-series tabular data (numeric and/or categorical data with a time-dependency)
  • Raw text data (text classification)
  • Raw image data (image classification)

Multiple Algorithms

Although some tools handle multiple domains, the majority of AutoML tools are designed for the most common use-case which is IID tabular data (a table with rows and columns). In open source, that would include tools such as H2O AutoML, auto-sklearn (along with it’s predecessor, Auto-WEKA) and TPOT. H2O.ai’s Driverless AI is a platform that’s geared towards IID tabular data, but also supports time-series data and raw text. In the case of Driverless AI, automatic feature generation is also part of the AutoML process (and one of the key differentiators between open source H2O AutoML and Driverless AI).

Since there is no single algorithm that consistently performs the best across all datasets (a consequence of the “No Free Lunch Theorem”), these AutoML tools explore a variety of algorithms such as Gradient Boosting Machines (GBMs), Random Forests, GLMs, and in some cases also consider Deep Neural Networks. Another approach shared by most of these tools is ensembling several models together to get a stronger final model, a technique which wins a majority of Kaggle competitions.

Deep Learning: Neural Architecture Search

On the other end of the spectrum, a technique most commonly used in the domain of image classification problems, is an AutoML method called “neural architecture search” (NAS). When training a Deep Neural Network, there are many hyperparameters to tune (e.g. learning rate, batch size, dropout rate), however one of the biggest contributors to model performance is the network architecture. Two areas where Deep Neural Networks have improved performance over traditional machine learning methods is image classification and natural language processing (NLP).

There are a handful of open source tools for neural architecture search, including a TensorFlow and PyTorch implementation of Efficient Neural Architecture Search (ENAS), and Auto-Keras which performs an efficient NAS using Bayesian Hyperparamter Optimization. On the SaaS side, Google Cloud offers AutoML Cloud Vision and Cloud Natural Language. They perform a neural architecture search for image and text classification, respectively. These tools require raw image or text data directly, and therefore are not appropriate for your typical numeric or categorical tabular data.

How can AutoML Help You?

If you’re part of the majority of data scientists who work with tabular or “relational” data (tables with numeric and/or categorical columns), then H2O AutoML or Driverless AI are great tools to use.

2017 Kaggle.com Data Science Survey Results: What type of data is used at work?

However, if you have image data, Google Cloud AutoML Vision is an option, however, there are several open source tools (listed above) that will do the same thing at no cost, and allow you to keep your data off the cloud. Another option is to skip the neural architecture search altogether, and use a pre-trained image classification model combined with transfer learning, as explained in this article by fast.ai co-founder, Rachel Thomas.

In the case of text data, you could use a general purpose tool such as Driverless AI, or a tool specifically designed for text such as Google Cloud Natural Language. If you prefer an open source solution, you can apply H2O’s Word2Vec algorithm (or any other open source text processing tool) to convert the text into a numeric format which can be used by H2O AutoML (direct support for text data is on the H2O AutoML roadmap).

Conclusion

There are many tools out there and each tool is typically designed for a specific use-case or set of use-cases in mind. There are many options to consider, including data preparation, data privacy, cost, modeling options (types of algorithms inside), deployment options and ease-of-use. We hope that you’ll give H2O AutoML and Driverless AI a try!

About the Author

Erin LeDell
Erin LeDell

Erin is the Chief Machine Learning Scientist at H2O.ai. Erin has a Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on automatic machine learning, ensemble machine learning and statistical computing. She also holds a B.S. and M.A. in Mathematics. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE Digital in 2016) and Marvin Mobile Security (acquired by Veracode in 2012), and the founder of DataScientific, Inc.

Leave a Reply

Finding Clarity in the Automated Modeling Space

There is an arms race happening in Data Science and Machine Learning space. It's the

December 12, 2018 - by Jo-Fai Chow
For Today’s BI Analyst – Accelerating your AI/ML efforts with Driverless AI

Whether you are starting out as a novice data scientist or a veteran in AI

December 10, 2018 - by Jo-Fai Chow
The Making of H2O Driverless AI – Automatic Machine Learning

It is my pleasure to share with you some never before exposed nuggets and insights

December 5, 2018 - by Arno Candel
Gratitude and thank you, makers!

Makers, Happy Thanksgiving - Hope you get to spend time with your loved ones this week. Thank them

November 21, 2018 - by Saurabh Kumar
New features in H2O 3.22

Xia Release (H2O 3.22) There's a new major release of H2O and it's packed with new

November 12, 2018 - by Jo-Fai Chow
Top 5 things you should know about H2O World London

We had a blast at H2O World London last week! With a record number of

November 6, 2018 - by Bruna Smith

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img