Beta Testing

Driverless AI Pipeline Tips

By James Medel posted 06-03-2020 13:54

  

Given training data and a target column to predict, H2O Driverless AI produces an end-to-end pipeline tuned for high predictive performance (and/or high interpretability) for general classification and regression tasks. The pipeline has only one purpose: to take a test set, row by row, and turn its feature values into predictions.

A typical pipeline creates dozens or even hundreds of derived features from the user-given dataset. Those transformations are often based on precomputed lookup tables and parameterized mathematical operations that were selected and optimized during training. It then feeds all these derived features to one or several machine learning algorithms such as linear models, deep learning models, or gradient boosting models (and several more derived models). If there are multiple models, then their output is post-processed to form the final prediction (either probabilities or target values). The pipeline is a directed acyclic graph.

It is important to note that the training dataset is processed as a whole for better results (e.g., aggregate statistics). For scoring, however, every row of the test dataset must be processed independently to mimic the actual production scenario.

To facilitate deployment to various production environments, there are multiple ways to obtain predictions from a completed Driverless AI experiment, either from the GUI, from the R or Python client API, or from a standalone pipeline.

GUI

  • Score on Another Dataset - Convenient, parallelized, ideal for imported data

  • Download Predictions - Available if a test set was provided during training

  • Deploy - Creates an Amazon Lambda endpoint (more endpoints coming soon)

  • Diagnostics - Useful if the test set includes a target column

Client APIs

  • Python client - Use the make_prediction_sync() method. An optional argument can be used to get per-row and per-feature ‘Shapley’ prediction contributions. (Pass pred_contribs=True.)

  • R client - Use the predict() method. An optional argument can be used to get per-row and per-feature ‘Shapley’ prediction contributions. (Pass pred_contribs=True.)

Standalone Pipelines

  • Python - Supports all models and transformers, and supports ‘Shapley’ prediction contributions and MLI reason codes

  • Java - Most portable, low latency, supports all models and transformers that are enabled by default (except TensorFlow NLP transformers), can be used in Spark/H2O-3/SparklingWater for scale

  • C++ - Highly portable, low latency, standalone runtime with a convenient Python and R wrapper


#driverless-ai
0 comments
4 views

Permalink