By: Michal Kurka
There’s a new major release of H2O, and it’s packed with new features and fixes! Among the big new features in this release, we’ve introduced support for Hierarchical GLM, added an option to parallelize Grid Search, upgraded XGBoost with newly added features, and improved our AutoML framework. The release is named after Bin Yu.
We are very excited to add HGLM (Hierarchical GLM) to our open source offering. HGLM fits generalized linear models with random effects, where the random effect can come from a conjugate exponential-family distribution (for example, Gaussian). HGLM allows you to specify both fixed and random effects, which allows fitting correlated to random effects as well as random regression models. HGLM can be used for linear mixed models and for generalized linear mixed models with random effects for a variety of links and a variety of distributions for both the outcomes and the random effects.
You can review the detailed documentation here. This release implements HGLM for the Gaussian family. However, stay tuned or better yet, tell us which distributions you want to see next. Try it out and send us your feedback! In addition, we would also like your feedback on the model metrics that you are interested in seeing.
Parallelized Grid Search
This release adds a new way to speed up grid search by training n models in parallel. Both in Python and R, a new parameter has been added to determine the level of parallelism used during Grid Search. By default, sequential model building is ensured with
parallelism = 1. There are two additional ways to set the level of parallelism:
- Manual setting of the number of models built in parallel:
parallelism = nwhere
n > 1
- H2O Heuristics:
parallelism = 0
H2O will always attempt to train the number of models determined by the
parallelism argument simultaneously. Once a model is finished, it is added to the grid, and another one is started immediately.
XGBoost upgrade and new features (checkpointing, Platt Scaling)
We’ve upgraded the XGBoost library to version 0.90, which brings a wide range of bug fixes and performance improvements. More details can be found in the XGBoost 0.90 Release Notes. As the XGBoost 1.0 release draws near, we will be ready to integrate it with H2O as soon as possible.
Platt Scaling is now available for XGBoost. You can now use the
calibration_frame parameters when training an XGBoost model.
XGBoost now supports resuming from a trained model (checkpointing) as well.
H2O can now be used on Cloudera Data Platform, an enterprise-ready cloud data environment. In addition to that, we added support for new versions of Hadoop distributions, improved support for Kerberos authentication (SPNEGO) and improved cloud-forming on clusters with restricted network access.
Logging in h2o-genmodel.jar
Logging possibilities for the
hex.genmodel.easy.EasyPredictModelWrapper with contributions enabled were extended. H2O now supports the SLF4J logging library, but no SLF4J library is bundled in the H2O module; therefore, you must ensure the library is present on the classpath in order to use this new functionality. When there is no SLF4J library on the classpath, original logging functionality is preserved.
This new release also comes with a set of new features for H2O AutoML.
Monotonic constraints can now be enforced in AutoML to improve the predictive performance of the models. In some cases, constraints can be used to improve the predictive performance of the model. Among the set of models trained in an AutoML run, only XGBoost and H2O GBM models are able to enforce monotonicity; however, this can also improve the Stacked Ensembles.
We are now providing more details about the models produced by AutoML in an extended version of the leaderboard (thanks to the new
get_leaderboard function in the Python and R clients). For now, this includes information like the training time of the model and the average prediction time (per row), but we plan to add more useful model information in future releases. Prediction speed is especially useful to measure when considering which models to deploy to production. The leaderboard now also includes the Area Under the Precision-Recall Curve (AUCPR) as an additional metric for binary classification problems (also available as a new option for
sort_metric in AutoML).
To improve AutoML reproducibility across versions, or simply to give you a bit more control over the training pipeline, we now expose a new
modeling_plan parameter listing the steps taken into consideration by AutoML, and in return, we expose a
modeling_steps property in the AutoML object, which lists all the steps that were actually used during the training. As mentioned, for reproducibility, this last list can in return be fed back into a new AutoML instance by passing its value to the
modeling_plan parameter. This also opens the door to various customizations, like the possibility for you to plug your own steps written in Java, but we’ll tell you more about this very soon.
Finally, you’ll find the usual bundle of bug fixes and minor improvements, among which is an improved AutoML widget in the Flow UI, a more consistent handling of AutoML reruns (when retraining AutoML with the same project name), and a noticeable change in the default value for the
max_runtime_secs parameter, whose previous value appeared inconvenient in many cases.
Scikit-learn users will be happy to learn that this release comes with a new integration API that removes most limitations of the legacy support.
sklearn-compatible estimators and transformers are exposed in a new
h2o.sklearn module. They accept the same parameters as their twins from the
h2o.automl modules, but in addition to
H2OFrame, they also accept
numpy arrays or
pandas dataframes and can, therefore, be combined with usual Scikit-learn transformers in a
Pipeline (for example). Also, all the standard
sklearn ways of setting or modifying the parameters of those estimators are now supported.
This release adds easy-to-follow code examples to most of the functions in the Python Module documentation. This documentation is available at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html.
What is coming next?
The current K-means algorithm implementation uses the Lloyd iteration principle to determine optimal clusters depending on distances of data from centroids. Within this release, we prepared an improvement to this algorithm and added the possibility to set the minimum number of data points in each cluster.
To satisfy the custom minimal cluster size, the calculation of clusters is converted to the Minimal Cost Flow problem. A graph is constructed based on the distances and constraints. The goal is to go iteratively through the data points represented as input edges of the graph and create an optimal spanning tree that satisfies the constraints. However, the Minimum-cost Flow problem can be efficiently solved in polynomial time.
The performance of our implementation of the constrained K-means algorithm is not optimal because still needs to be improved and we thus didn’t expose the feature to users. The feature will be released in one of the minor releases of the 3.28 release cycle.