By: Michal Kurka
There’s a new major release of H2O, and it’s packed with new features and fixes! Among the big new features in this release, we’ve added Extended Isolation Forest for improved results on anomaly detection problems, and we’ve implemented the Type III SS test (ANOVAGLM) and the MAXR method to GLM. For existing algorithms, we improved the performance of our GBM and DRF models in a cross-validation setting with early stopping, added a permutation feature importance for black-box models, and improved algorithms introduced in the previous major release (GAM). On the technical side, we improved the training speed of our GBM algorithm, lowered memory usage of cross-validation, and enabled H2O to run on the latest version of Java. This release is named after Vaclav Zizler. You can download this release from our website.
New Algorithm: Extended Isolation Forest
Extended Isolation Forest (EIF) is an algorithm for unsupervised anomaly detection based on the Isolation Forest algorithm. The original Isolation Forest algorithm brings a brand new form of anomaly detection, although the algorithm suffers from bias due to tree branching. Extension of the algorithm mitigates the bias by adjusting the branching, and the original algorithm becomes just a special case.
In the left picture is Extended Isolation Forest. In the right picture is Isolation Forest, simulated by Extended Isolation Forest with extension_level=0.
The demo we provide in this Jupyter notebook is an excellent place to start and to gain more information.
Type III SS Test (ANOVAGLM), Maximum R Square Improvement to GLM (MaxRGLM)
We are excited to introduce two new algorithms to our GLM toolbox: the Type III SS Test (ANOVAGLM) and Maximum R Square Improvement to GLM (MaxRGLM). Here is a summary of what the new algorithms do:
H2O ANOVAGLM is used to calculate the Type III SS Test which is used to investigate the contributions of individual predictors and their interactions to a model. Predictors or interactions with negligible contributions to the model will have high p-values while those with more contributions will have low p-values. H2O ANOVAGLM supports all families. In addition, interactions can be between 2, 3, … and n predictors. Cross-validation, validation datasets, and mojo are not supported.
H2O Maximum R Square Improvement to GLM (MaxRGLM) finds the best one-variable model, the best two-variable model, … and up to the best n-variable model. Unlike conventional Maximum R Square Improvement models, H2O MaxRGLM is guaranteed to find the model with the largest R2 for each variable subset size. H2O MaxRGLM works only with regression, and it can be computationally intensive to use. Cross-validation and validation datasets are supported. However, no mojo is supported.
Permutation Feature Importance
H2O-3 now features a new functionality to measure variable importance: Permutation Variable Importance. This is supported by all supervised models, and for some models (e.g. Stacked Ensemble models) it’s the only supported method.
The Permutation Variable Importance of each variable is obtained by measuring the change of a metric when that variable is permuted. This allows you to look at how variable importance depends on any supported metric.
Permutation Variable Importance is added in two methods/functions: one that produces a numerical result (i.e. a data frame) and one that plots the variable importance. Since Permutation Variable Importance results depend on the actual permutation, you can specify a number of repeated runs, and in that case, the output is a box plot as opposed to the default bar plot.
There are also several minor bug fixes and usability improvements in the toolkit for Model Explainability. The most significant is being able to provide a leaderboard frame to the
explain function instead of the H2OAutoML object. This allows you to explicitly sort the explained models by any metric/column in the (extended) leaderboard (even columns like
predict_time_per_row_ms) and run
explain for any subset of the models more easily.
This release of H2O brings technical improvements to our popular tree-based algorithms: GBM and DRF. We optimized the histogram building procedure that is crucial for the fast performance of GBM/DRF. Thanks to these improvements, we observed a speed-up of up to 30% for model training times (depending on datasets). Users should also see a slight improvement in the runtime performance of cross-validation thanks to optimized access to internal CV weights. On top of these changes, we also optimized the way holdout predictions are represented in CV models. This change resulted in lower memory requirements to store the holdout frame. We also fined-tuned the method H2O uses to pick the optimal number of trees when early stopping is used to train CV models. With this change, users should see a slight increase in accuracy and be able to build models that generalize better.
A long-standing issue affecting predominantly our community users was a strict requirement on specific Java versions. H2O has had a strict process of certifying each release for compatibility with new Java versions. This, however, presented a challenge in releasing H2O versions for the newest Java versions. To address this issue, we enabled H2O to run on an officially not certified Java version as long as the JVM passes a compatibility check we built into H2O itself. This means that the vast majority of users can now start using the latest (and also future) Java versions.
The H2O AutoML algorithm went through some positive restructuring and a re-prioritization of tasks that we have found leads to performance gains on a majority of datasets.
We introduced a new concept of “model groups” inside the algorithm and priority levels to each of these groups. We are currently favoring a combination of fast and high performing models at the beginning of the run, with more random searching in the later stages of the run.
Another major change was to use as much of the time given by the user as possible. In previous versions, the user would specify a maximum runtime, and we employed early stopping to stop AutoML if we detected flattening/declining improvements over time. Now, we re-start the two most promising random grids to fully utilize the “extra” time. In most cases, running the whole time (vs. early-stopped version) ultimately leads to better final models. Consequently, the user has more direct control over how much time they want to use in AutoML, and this makes H2O AutoML easier to benchmark as well.
Another major change is introducing iterative Stacked Ensembles. Rather than only training two ensembles (i.e. the “All Models” and “Best of Family” ensembles), we train both of these after each model group. This creates a more diverse set of ensembles, including more sparse (fast) ensembles, and also provides the user with ensemble results more quickly, leading to runs that are more consistently respecting the user-specified runtime limit. Stacked Ensembles can always be turned off, but they are generated automatically by default.
Sparkling Water Improvements
The Sparkling Water pipeline API was extended with four new algorithms. The first two, H2OPCA and H2OGLRM, serve to reduce the number of features so that the subsequent pipeline stages can be trained faster while preserving the overall pipeline performance. The third new algorithm is H2OAutoEncoder. This deep-learning-based algorithm can currently be used for encoding an arbitrary list of features to the vector of numerical values and anomaly detection problems. These algorithms will also offer feature reduction capabilities in future. Last, but not least, is the introduction of H2ORuleFit. This algorithm combines tree ensembles and linear models to take advantage of both methods: a tree ensemble’s accuracy and a linear model’s interpretability.
The Sparkling Water MOJO model representation, H2OMOJOModel, now offers more comprehensive information about the model performance. Apart from standard performance metrics (e.g. AUC, RMSE, etc.), H2OMOJOModel also provides more complex information (e.g. confusion matrix, gains lift table, values for individual ROC curve thresholds, etc.).
H2OMOJOModel also gives better insight to the performance of individual cross-validation folds. A user can get a quick overview about the fold performance information of a table where a row represents specific metrics and a column represents an individual fold. Eventually, a user can go further and obtain cross-validation models. A cross-validation model is a regular H2OMOJOModel and contains all the complex performance information mentioned above. Cross-validation models could also be used for scoring, so users can evaluate their performance in their own way.
We’ve localized the Java requirements for H2O-3 into one useful section with all further references to Java requirements pointing to this section. The link and distribution type equations have also been formally laid out to show how each type is calculated. We also now show how to examine AutoML trained models more closely and have added output example files for Python/R/HTML to the H2O Explain documentation.
In the works for future releases: Uplift
We are working on a new algorithm: Distributed Uplift Random Forest (Uplift DRF). Uplift DRF is a classification tool for modeling uplift – the incremental impact of a treatment. This algorithm is based on the Uplift trees, so the uplift models by specific tree building.
Uplift DRF can be applied to fields where we operate with two groups of subjects. The first group, let’s call it treatment, receives a treatment (e.g. marketing campaign, medicine, etc.), and the second group, let’s call it control, is separated from the treatment. We also gather information about their response (e.g. whether they bought a product, recover from disease, etc.).
Uplift trees take information about the treatment/control group assignment and information about the response directly into a decision about splitting a node. The output of the uplift model is the probability of changes in a user’s behavior. This helps to decide if the treatment impacts a desired behavior (e.g. buying a product, recovering from disease, etc.). In other words, if a user responds because the user was treated. This leads to proper campaign targeting on a subject that genuinely needs to be treated and avoids wasting resources on subjects that respond/do not respond anyway.
Check out our user guide for more information:
- Isolation Forest and Extended Isolation Forest
- Maximum R Square Improvements (MAXR)
- Permutation Variable Importance
- Gradient Boosting Machine (GBM)
- Distributed Random Forest (DRF)
- Sparkling Water
This new H2O release is brought to you by Achraf Merzouki, Adam Valenta, Erin LeDell, Hannah Tillman, Joseph Granados, Marek Novotný, Michal Kurka, Neema Mashayekhi, Sebastien Poirier, Tomáš Frýda, Veronika Maurerova, Wendy Wong, and Zuzana Olajcova.