August 15th, 2018

The different flavors of AutoML

RSS icon RSS Category: AutoML, Data Science, H2O, H2O Driverless AI
Ice cream banner

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software (e.g. H2O, scikit-learn, keras). Although these tools have made it easy to train and evaluate machine learning models, there is still a good amount of data science knowledge that’s required in order to create the highest-quality model, given your dataset. Writing the code to perform a hyperparameter search over many different types of algorithms can also be time consuming and repetitive work.

What is AutoML?

The term “AutoML” (Automatic Machine Learning) refers to automated methods for model selection and/or hyperparameter optimization. AutoML is also a subfield of machine learning that has a rich academic history, an annual workshop at the International Conference on Machine Learning (ICML), and academic research labs devoted to this topic (e.g. University of Freiburg Machine Learning Lab in Germany).

The AutoML field began by developing methods for automating hyperparameter optimization in single models, and now includes such techniques as automated stacking (ensembles), neural architecture search, pipeline optimization and feature engineering.

AutoML Tools

The goal of AutoML software is two-fold:

  1. To enable non-experts to train high quality machine learning models.
  2. To improve the efficiency of finding optimal solutions to machine learning problems.

There are a handful of different AutoML platforms (open source, closed source and SaaS), aiming to solve different types of supervised machine learning problems. AutoML tools can largely be categorized by use-case or more simply, by the format of the training data.

  • IID tabular data (numeric and/or categorical data)
  • Time-series tabular data (numeric and/or categorical data with a time-dependency)
  • Raw text data (text classification)
  • Raw image data (image classification)

Multiple Algorithms

Although some tools handle multiple domains, the majority of AutoML tools are designed for the most common use-case which is IID tabular data (a table with rows and columns). In open source, that would include tools such as H2O AutoML, auto-sklearn (along with it’s predecessor, Auto-WEKA) and TPOT.’s Driverless AI is a platform that’s geared towards IID tabular data, but also supports time-series data and raw text. In the case of Driverless AI, automatic feature generation is also part of the AutoML process (and one of the key differentiators between open source H2O AutoML and Driverless AI).

Since there is no single algorithm that consistently performs the best across all datasets (a consequence of the “No Free Lunch Theorem”), these AutoML tools explore a variety of algorithms such as Gradient Boosting Machines (GBMs), Random Forests, GLMs, and in some cases also consider Deep Neural Networks. Another approach shared by most of these tools is ensembling several models together to get a stronger final model, a technique which wins a majority of Kaggle competitions.

Deep Learning: Neural Architecture Search

On the other end of the spectrum, a technique most commonly used in the domain of image classification problems, is an AutoML method called “neural architecture search” (NAS). When training a Deep Neural Network, there are many hyperparameters to tune (e.g. learning rate, batch size, dropout rate), however one of the biggest contributors to model performance is the network architecture. Two areas where Deep Neural Networks have improved performance over traditional machine learning methods is image classification and natural language processing (NLP).

There are a handful of open source tools for neural architecture search, including a TensorFlow and PyTorch implementation of Efficient Neural Architecture Search (ENAS), and Auto-Keras which performs an efficient NAS using Bayesian Hyperparamter Optimization. On the SaaS side, Google Cloud offers AutoML Cloud Vision and Cloud Natural Language. They perform a neural architecture search for image and text classification, respectively. These tools require raw image or text data directly, and therefore are not appropriate for your typical numeric or categorical tabular data.

How can AutoML Help You?

If you’re part of the majority of data scientists who work with tabular or “relational” data (tables with numeric and/or categorical columns), then H2O AutoML or Driverless AI are great tools to use.

2017 Data Science Survey Results: What type of data is used at work?

However, if you have image data, Google Cloud AutoML Vision is an option, however, there are several open source tools (listed above) that will do the same thing at no cost, and allow you to keep your data off the cloud. Another option is to skip the neural architecture search altogether, and use a pre-trained image classification model combined with transfer learning, as explained in this article by co-founder, Rachel Thomas.

In the case of text data, you could use a general purpose tool such as Driverless AI, or a tool specifically designed for text such as Google Cloud Natural Language. If you prefer an open source solution, you can apply H2O’s Word2Vec algorithm (or any other open source text processing tool) to convert the text into a numeric format which can be used by H2O AutoML (direct support for text data is on the H2O AutoML roadmap).


There are many tools out there and each tool is typically designed for a specific use-case or set of use-cases in mind. There are many options to consider, including data preparation, data privacy, cost, modeling options (types of algorithms inside), deployment options and ease-of-use. We hope that you’ll give H2O AutoML and Driverless AI a try!

About the Author

Erin LeDell
Erin LeDell

Erin is the Chief Machine Learning Scientist at Erin has a Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on automatic machine learning, ensemble machine learning and statistical computing. She also holds a B.S. and M.A. in Mathematics. Before joining, she was the Principal Data Scientist at (acquired by GE Digital in 2016) and Marvin Mobile Security (acquired by Veracode in 2012), and the founder of DataScientific, Inc.

Leave a Reply

New Features Now Available with the Latest Release of the H2O AI Hybrid Cloud 21.10

The Makers here at have been busy building new features and enhancing capabilities across

October 18, 2021 - by
Time Series Forecasting Best Practices

Earlier this year, my colleague Vishal Sharma gave a talk about time series forecasting best

October 15, 2021 - by Jo-Fai Chow
Improving NLP Model Performance with Context-Aware Feature Extraction

I would like to share with you a simple yet very effective trick to improve

October 8, 2021 - by Jo-Fai Chow
Feature Transformation with the H2O AI Hybrid Cloud

It is well known throughout the data science community that data preparation, pre-processing, and feature

October 7, 2021 - by Benjamin Cox
Introducing DatatableTon – Python Datatable Tutorials & Exercises

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data

September 20, 2021 - by Rohan Rao
H2O Release 3.34 (Zizler)

There’s a new major release of H2O, and it’s packed with new features and fixes!

September 15, 2021 - by Michal Kurka

Start your 14-day free trial today