August 15th, 2018
The different flavors of AutoMLShare Category: AutoML, Data Science, Driverless AI, H2O
By: Erin LeDell
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software (e.g. H2O, scikit-learn, keras). Although these tools have made it easy to train and evaluate machine learning models, there is still a good amount of data science knowledge that’s required in order to create the highest-quality model, given your dataset. Writing the code to perform a hyperparameter search over many different types of algorithms can also be time consuming and repetitive work.
What is AutoML?
The term “AutoML” (Automatic Machine Learning) refers to automated methods for model selection and/or hyperparameter optimization. AutoML is also a subfield of machine learning that has a rich academic history, an annual workshop at the International Conference on Machine Learning (ICML), and academic research labs devoted to this topic (e.g. University of Freiburg Machine Learning Lab in Germany).
The AutoML field began by developing methods for automating hyperparameter optimization in single models, and now includes such techniques as automated stacking (ensembles), neural architecture search, pipeline optimization and feature engineering.
The goal of AutoML software is two-fold:
- To enable non-experts to train high quality machine learning models.
- To improve the efficiency of finding optimal solutions to machine learning problems.
There are a handful of different AutoML platforms (open source, closed source and SaaS), aiming to solve different types of supervised machine learning problems. AutoML tools can largely be categorized by use-case or more simply, by the format of the training data.
- IID tabular data (numeric and/or categorical data)
- Time-series tabular data (numeric and/or categorical data with a time-dependency)
- Raw text data (text classification)
- Raw image data (image classification)
Although some tools handle multiple domains, the majority of AutoML tools are designed for the most common use-case which is IID tabular data (a table with rows and columns). In open source, that would include tools such as H2O AutoML, auto-sklearn (along with it’s predecessor, Auto-WEKA) and TPOT. H2O.ai’s Driverless AI is a platform that’s geared towards IID tabular data, but also supports time-series data and raw text. In the case of Driverless AI, automatic feature generation is also part of the AutoML process (and one of the key differentiators between open source H2O AutoML and Driverless AI).
Since there is no single algorithm that consistently performs the best across all datasets (a consequence of the “No Free Lunch Theorem”), these AutoML tools explore a variety of algorithms such as Gradient Boosting Machines (GBMs), Random Forests, GLMs, and in some cases also consider Deep Neural Networks. Another approach shared by most of these tools is ensembling several models together to get a stronger final model, a technique which wins a majority of Kaggle competitions.
Deep Learning: Neural Architecture Search
On the other end of the spectrum, a technique most commonly used in the domain of image classification problems, is an AutoML method called “neural architecture search” (NAS). When training a Deep Neural Network, there are many hyperparameters to tune (e.g. learning rate, batch size, dropout rate), however one of the biggest contributors to model performance is the network architecture. Two areas where Deep Neural Networks have improved performance over traditional machine learning methods is image classification and natural language processing (NLP).
There are a handful of open source tools for neural architecture search, including a TensorFlow and PyTorch implementation of Efficient Neural Architecture Search (ENAS), and Auto-Keras which performs an efficient NAS using Bayesian Hyperparamter Optimization. On the SaaS side, Google Cloud offers AutoML Cloud Vision and Cloud Natural Language. They perform a neural architecture search for image and text classification, respectively. These tools require raw image or text data directly, and therefore are not appropriate for your typical numeric or categorical tabular data.
How can AutoML Help You?
If you’re part of the majority of data scientists who work with tabular or “relational” data (tables with numeric and/or categorical columns), then H2O AutoML or Driverless AI are great tools to use.
However, if you have image data, Google Cloud AutoML Vision is an option, however, there are several open source tools (listed above) that will do the same thing at no cost, and allow you to keep your data off the cloud. Another option is to skip the neural architecture search altogether, and use a pre-trained image classification model combined with transfer learning, as explained in this article by fast.ai co-founder, Rachel Thomas.
In the case of text data, you could use a general purpose tool such as Driverless AI, or a tool specifically designed for text such as Google Cloud Natural Language. If you prefer an open source solution, you can apply H2O’s Word2Vec algorithm (or any other open source text processing tool) to convert the text into a numeric format which can be used by H2O AutoML (direct support for text data is on the H2O AutoML roadmap).
There are many tools out there and each tool is typically designed for a specific use-case or set of use-cases in mind. There are many options to consider, including data preparation, data privacy, cost, modeling options (types of algorithms inside), deployment options and ease-of-use. We hope that you’ll give H2O AutoML and Driverless AI a try!