November 12th, 2018

New features in H2O 3.22

Category: H2O Release

Xia Release (H2O 3.22)

There’s a new major release of H2O and it’s packed with new features and fixes! Among the big new features in this release, we introduce Isolation Forest to our portfolio of machine learning algorithms and integrates the XGBoost algorithm into our AutoML framework. The release is named after Zhihong Xia.

Isolation Forest

Isolation Forest is an unsupervised machine learning algorithm used for anomaly detection. Anomaly detection is applicable to a variety of uses cases, including Fraud Detection or Intrusion Detection. The Isolation Forest algorithm is different from other methods typically used for anomaly detection: it directly identifies the exceptional observations instead of learning the pattern of the normal observations (as is done in the H2O deep learning based autoencoder). The H2O implementation of Isolation Forest is based on the Distributed Random Forest algorithm, so it is capable of analyzing large datasets in multi-node clusters. Note that Isolation Forest is currently in a Beta state. Additional enhancements and improvements will be made in future releases. A blog post is available for more information.

Inspection of Tree-based models

During the development of H2O-3 version 3.21.x, an API for tree inspection was introduced for both the Python and R clients. With the Tree API, it is possible to download, traverse and inspect individual trees inside tree-based algorithms. In this release, this API can be used to fetch any tree from any tree-based model (Gradient Boosting Machines, Distributed Random Forest, XGBoost and Isolation Forest). For more details, please see our latest documentation for Python and for R. There is also a blog post available.

XGBoost in AutoML

Our AutoML framework now includes the XGBoost algorithm, one of the most popular and powerful machine learning algorithms. H2O users have been able to leverage the power of XGBoost for quite some time, however, in the 3.22 release we focused on further performance and stability improvements of our XGBoost implementation. Thanks to these improvements we were able to include XGBoost in the fully automated setting of AutoML. XGBoost models built during the AutoML process will also be included in the final Stacked Ensemble models. Because XGBoost models are typically some of the top performers on the AutoML Leaderboard and also since Stacked Ensemble models benefit from the added diversity of models, users can expect that the final performance H2O AutoML to be improved on many datasets.

Target Encoding

Feature engineering in H2O has been enhanced with the possibility of encoding categorical variables using mean of a target variable. It can be performed in two easy steps. First step is to create a target-encoding map. As mean encoding is prone to overfitting, there are several ways to avoid it included. Second step is to simply apply the target-encoding map created in the first step. New columns with target-encoding values are then added to the data. Previously, target encoding had only been available in R, but in 3.22, it’s now available in Java and Python as well. For details, please see the documentation.

Additional Highlights

Below is a list of some of the highlights from the 3.22 release. As usual, you can see a list of all the items that went into this release at the Changes.md file in the h2o-3 GitHub repository.

New Features:

  • [PUBDEV-5170] – Individual predictions of GBM trees are now exposed in the MOJO API.
  • [PUBDEV-5775] – It is now possible to combine two models into one MOJO, with the second model using the prediction from the first model as a feature. These models can be from any algorithm or combination of algorithms except Word2Vec.
  • [PUBDEV-5988] – Users can now specify a `-features` parameter when starting h2o from the command line. This allows users to remove experimental or beta algorithms when starting H2O-3. Available options for this parameter include `beta`, `stable`, and `experimental`.
  • [PUBDEV-5695] – Created an R demo for CoxPH, available here.

Bugs:

  • [PUBDEV-5746] – Improved efficiency of the `keep_cross_validation_models` parameter in AutoML
  • [PUBDEV-5903] – In AutoML, StackEnsemble models are now always trained, even if we reached `max_runtime_secs` limit.
  • [PUBDEV-5998] – Exposed H2OXGBoost parameters used to train a model to the Python API. Previously, this information was visible in the Java backend but was not passed back to the Python API.
  • [PUBDEV-6005] – When running AutoML in Flow, updated the list of algorithms that can be selected in the “Exclude These Algorithms” section.

Docs:

  • [PUBDEV-4505] – Added Scala and Java examples to the Building and Extracting a MOJO topic.
  • [PUBDEV-4590] – Added a Scala example to the Stacked Ensembles topic.
  • [PUBDEV-5756] – Added Python examples to the Cross-Validation topic in the User Guide.
  • [PUBDEV-5982] – Added documentation for Isolation Forest (beta).

Download our latest release: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

About the Authors

Erin LeDell
Erin LeDell

Erin is the Chief Machine Learning Scientist at H2O.ai. Erin has a Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on automatic machine learning, ensemble machine learning and statistical computing. She also holds a B.S. and M.A. in Mathematics. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE Digital in 2016) and Marvin Mobile Security (acquired by Veracode in 2012), and the founder of DataScientific, Inc.

michal kurka
Michal Kurka

Michal is a software engineer with a passion for crafting code in Java and other JVM languages. He started his professional career as a J2EE developer and spent his time building all sorts of web and desktop applications. Four years ago he truly found himself when he entered the world of big data processing and Hadoop. Since then he enjoys working with distributed platforms and implementing scalable applications on top of them. He holds a Master of Computer Science form Charles University in Prague. His field of study was Discrete Models and Algorithms with focus on Optimization.

Pavel
Pavel Pscheidl

Pavel is a machine learning engineer at H2O. Holding a master's degree in Applied Informatics, his main focus during his studies was applied statistics & stochastic methods, agent-based simulations and optimization. He joined a research team as a Ph.D. candidate while working on various problems like the effectiveness of fraud detection methods in highly-distributed systems. Due to his roots in computer science, his commercial focus was on enterprise Java systems and related standards. He also wrote a book in this field. In 2017, Pavel joined H2O's awesome team, abandoning all other activities, including research at the university. At H2O, he is proud of being able to leverage his passion for algorithms and optimization while diving deeper into statistics every single day.

Angela Bartz
Angela Bartz

Angela is the doc whisperer at H2O. She began writing when she was in her single digits, though these early documents are either heavily redacted or remain confidential. Since graduating from the University of Detroit Mercy with a B.A. in English, she has worked as a technical writer in a variety of industries. Her affinity for machine learning software grew from her two-year tenure reigning over all documentation at Skytree. Angela loves learning new software, new tools, and new approaches for delivering documentation. When she’s not working, Angela likes to spend time with her husband (with whom she has fierce sporting rivalries) and their old dog (who has to put up with said rivalries). She also enjoys trying out new restaurants and tasting craft beers, especially those made by her friends.

Leave a Reply

Finding Clarity in the Automated Modeling Space

There is an arms race happening in Data Science and Machine Learning space. It's the

December 12, 2018 - by Jo-Fai Chow
For Today’s BI Analyst – Accelerating your AI/ML efforts with Driverless AI

Whether you are starting out as a novice data scientist or a veteran in AI

December 10, 2018 - by Jo-Fai Chow
The Making of H2O Driverless AI – Automatic Machine Learning

It is my pleasure to share with you some never before exposed nuggets and insights

December 5, 2018 - by Arno Candel
Gratitude and thank you, makers!

Makers, Happy Thanksgiving - Hope you get to spend time with your loved ones this week. Thank them

November 21, 2018 - by Saurabh Kumar
Top 5 things you should know about H2O World London

We had a blast at H2O World London last week! With a record number of

November 6, 2018 - by Bruna Smith
Fallback Featured Image
Anomaly Detection with Isolation Forests using H2O

Introduction Anomaly detection is a common data science problem where the goal is to identify odd

November 6, 2018 - by angela

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img