By: Michal Kurka
There’s a new major release of H2O, and it’s packed with new features and fixes! Among the big new features in this release, we’ve introduced cross-version support for model import, added new features for model interpretation, provided much-improved support for reading data from Apache Hive, and included various algorithm and AutoML improvements. The release is named after Frank Yates, one of the pioneers of 20th-century statistics.
Native Hive Import
We now support direct import of Hive tables. The new
h2o.import_hive_table function (available in R and Python) reads table metadata from the Hive Metastore and then reads table data directly from HDFS. This provides much faster data access than existing JDBC support and a more convenient approach than reading data directly from HDFS via
Documentation for the Hive Import: link
Negative Binomial Support in GLM
We have added negative binomial regression to model count variables. For example, the count variables can be the number of days of student absences per month or the number of hospital visits in the past 12 months.
Negative binomial regression is a generalization of Poisson regression that loosens the restrictive assumption that the variance is equal to the mean. Instead, the variance of the negative binomial is a function of its mean and a parameter theta, the dispersion factor. The current version of negative binomial regression uses the penalized negative log likelihood to find the coefficients given a fixed theta. Users can also use a grid search method to find an optimal theta. Future releases will further simplify this process and optimize the coefficients as well as the dispersion parameter.
Documentation for Negative Binomial Regression: link
In H2O terminology, a MOJO represents a model trained in H2O itself, but packaged in a such a way as to be self-contained and easily deployable in Java environments. Starting with this release, a selected subset of MOJOs can be imported into H2O and used for scoring. Also, basic information about the model represented by the individual MOJOs can be displayed. The amount of detail of such information is, however, very specific for each model and may be limited to basic metrics. Such models are treated as generic, as H2O has no power over them, hence the name Generic models. MOJO models are designed to be forward compatible, which gives you an option to load a model trained in an older version of H2O to a newer version.
With this release, H2O supports import of selected MOJOs. The following MOJOs based on decision trees are supported: Gradient Boosting Machines, Distributed Random Forest, and Isolation Random Forest. In addition, Generalized Linear Model and K-Means cluster analysis MOJOs are importable as well. Importing these MOJOs as generic models is easy. An existing MOJO can be imported into H2O as a generic model using a single command. For more information, please visit Generic Models H2O documentation.
Model Reproducibility Enhancements
Being able to reproduce a trained model is important for users who operate in regulated industries. In this release, we have improved our documentation and we are providing detailed guidelines and best practices for building reproducible models and reproducing already existing models.
For those users who need reproducible models, we fixed a bug in R to get the seed properly. When a response from the H2O server is parsed, there is a conversion from character to numeric values. In R, the long type numbers are automatically converted to double precision, so the numbers out of range (-2^53, 2^53) could not be precise. The seed is a long type, so large numbers displayed incorrectly in the R client after parsing the JSON response and, thus, could not be used to reproduce the model. Auto-generated seeds from the H2O server are always in range of the long type (for example 8664354335142703762). Now, all big numbers are parsed as characters in the R client, so the seed can be represented as numeric or string in input and output. A user can set a new option
h2o.warning.on.json.string.conversion to see a warning when any conversion from a big number to characters occurs during parsing of a response.
Documentation for Reproducibility: link
Gradient boosting machines (in H2O implemented by GBM and XGBoost) are some of the most popular machine learning algorithms because they usually provide great performance on traditional transactional data. However, it is becoming increasingly more important to not just build models with good predictive power but also to be able to interpret the predictions of the complex models. This release of H2O brings support for calculating SHAP (SHapley Additive exPlanation) values. SHAP values are consistent and locally accurate feature attribution values that can be used by a data scientist to understand and interpret the predictions of the model. H2O users will be able to use a new function
predict_contributions to explain the predictions of their GBM and XGBoost models and visualize the explanation with the help of 3rd party packages. Users will also be able to calculate the contributions for models deployed in their production environment, and this feature is also included for use in our model deployment package (MOJO).
Example of H2O GBM to generate SHAP values: link
Blending in Stacked Ensembles
We have a new type of stacking available in H2O-3 – blending. Rather than training the metalearner (combiner) algorithm for the Stacked Ensemble using a frame of cross-validated predicted values of the base models, “blending” relies on fitting the metalearner using the predicted values for a user-specified holdout frame. Because this method does not rely on cross-validating the base models, it can save a lot of time (though this can slightly reduce model performance for the ensemble). In addition to the faster training time, one of the prominent use cases for blending is time series data. If you have data with a distributional time-shift, you should not be using k-fold cross-validation, and thus, traditional stacking that relies on cross-validation does not work well in these cases. However, if you partition your time-series data into training, blending, and test frames (the blending and test frames come after the training set in time), then you can use blending to create ensemble models.
This new option has been quickly integrated to our AutoML system as well. By default, AutoML uses traditional stacking (using cross-validation); however, if you specify a
blending_frame (a piece of holdout training data), that will force the Stacked Ensembles in AutoML to be trained using blending instead. If you do this, then you also have the option to completely turn off cross-validation (nfolds=0) in AutoML. This will give increased speed and train more models in the same amount of time (and in some cases, can lead to better final ensemble models because there will be more base models to learn from). If you turn off cross-validation, remember that you will should specify a frame to score the leaderboard using the
The latest release of AutoML includes a number of bug fixes as well as some usability and performance improvements. Mentioned above in the Stacked Ensembles section, the ensembles created in the AutoML process can now be trained without cross-validation via the
blending_frame argument. This means that more models can be trained in the same amount of time (by default, around ~5x), and this also provides the opportunity to use AutoML on (multi-predictor) time-series data.
To improve usability, we have added a
max_runtime_secs_per_model argument, which is disabled by default. This allows users to set a limit for the amount of time spent on a single model, which can be useful for certain datasets (e.g. tree-based models may take too much time on wide datasets). In addition to the
exclude_algos parameter, we now have an
include_algos. (Use whichever is more convenient, but not both at the same time.)
Lastly, we re-ordered of the sequence of algorithms within AutoML so that our curated set of XGBoost models will come first. Previously, we had started with Random Forest models, but in the cases where a user only runs AutoML for a single model, we wanted to make sure we start with XGBoost, which is often the strongest algorithm in the sequence. The full sequence of algorithms is documented in the AutoML FAQ.
This new H2O release is brought to you by Jan Sterba (Native Hive Import), Wendy Wong (Negative Binomial Support in GLM), Pavel Pscheidl (MOJO Import), Veronika Maurerova (Model Reproducibility Enhancements), Michal Kurka (TreeSHAP), Sebastien Poirier (Blending in Stacked Ensembles), and Erin LeDell (AutoML Improvements).