December 17th, 2020

New Improvements in H2O 3.32.0.2

RSS icon RSS Category: H2O Release, XGBoost

There is a new minor release of H2O that introduces two useful improvements to our XGBoost integration: interaction constraints and feature interactions.

Interaction Constraints

Feature interaction constraints allow users to decide which variables are allowed to interact and which are not.

Potential benefits:

  • Better predictive performance from focusing on interactions that work – whether through domain-specific knowledge or algorithms that rank interactions
  • Less noise in predictions; better generalization
  • More control given to the user on what the model can fit. For example, the user may want to exclude some interactions even if they perform well due to regulatory constraints

(Source: https://xgboost.readthedocs.io/en/latest/tutorials/feature_interaction_constraint.html)

The H2O documentation is available here.

XGBFI-like Tool for Revealing Feature Interactions

We have implemented ranks of features and feature interactions by various measures in XGBFI style. Thanks to this tool, H2O provides insights into higher-order interactions between features in trees all in a user-friendly manner. Additionally, leaf statistics and split value histograms are provided. The measures used are either one of:

Gain implies the relative contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction.

Cover is a metric to measure the number of observations affected by the split. Counted over the specific feature it measures the relative quantity of observations concerned by a feature.

Frequency (FScore) is the number of times a feature is used in all generated trees. Please note that it does not take the tree-depth nor tree-index of splits a feature occurs into consideration, neither the amount of possible splits of a feature. Hence, it is often suboptimal measure for importance or their averaged / weighed / ranked alternatives.

The H2O documentation is available here.

Example

The Jupyter notebook demo with all example codes presented below is available here.

Train XGBoostEstimator with interaction_constraints parameter:

# start h2o
import h2o
h2o.init()
 
from h2o.estimators.xgboost import *
# check if the H2O XGBoostEstimator is available
assert H2OXGBoostEstimator.available() is True
 
# import data
data = h2o.import_file(path = "../../smalldata/logreg/prostate.csv")
 
x = list(range(1, data.ncol-2))
y = data.names[len(data.names) - 1]
 
ntree = 5
 
h2o_params = {
    'eta': 0.3,
    'max_depth': 3, 
    'ntrees': ntree,
    'tree_method': 'hist'
}
 
# define interactions as a list of list of names of colums
# the lists defines allowed interaction
# the interactions of each column with itself are always allowed
# so you cannot specified list with one column e.g. ["PSA"]
h2o_params["interaction_constraints"] = [["CAPSULE", "AGE"], ["PSA", "DPROS"]]
 
# train h2o XGBoost model
h2o_model = H2OXGBoostEstimator(**h2o_params)
h2o_model.train(x=x, y=y, training_frame=data)

The result:

Display feature interactions:

# calculate multi-level feature interactions
h2o_model.feature_interaction()

Credits

This new H2O release is brought to you by Veronika Maurerova, Zuzana Olajcova, and Hannah Tillman.

How to Get Started?

Download H2O-3 from here and follow the steps in this example notebook. You can also check out our training center for both self-paced tutorials and instructor-led courses.

About the Author

Veronika Maurerova

Veronika is Software Engineer. She likes everything about Machine Learning and Artificial Intelligence. She finished master studies at Czech Technical University in Prague in 2017. Within master thesis, she cooperated with the Police of the Czech Republic. The goal was to prepare and analyze Czech crime data and build a predictive model. During studies at CTU, she had a part-time job in Ataccama software company as a Java Software Engineer. After she finished her studies, she had worked as a Machine Learning Engineer in Czech startup SEQENGI for nearly a year. In her spare time, she plays frisbee, travels, hikes, plays the ukulele, learns how to cook or bake something new, enjoys gardening and a much more.'

Leave a Reply

H2O.ai Placed Furthest in Completeness of Vision in 2021 Gartner Data Science and Machine Learning Magic Quadrant in the Visionaries Quadrant.

At H2O.ai, our mission is to democratize AI, and we believe driving value from data

March 9, 2021 - by
Learning from others is imperative to success on Kaggle says this Turkish GrandMaster

In conversation with Fatih Öztürk: A Data Scientist and a Kaggle Competition Grandmaster. In this series

February 15, 2021 - by Parul Pandey
H2O-3 Improvements from Two University Projects

In September 2019 H2O.ai became a silver partner of the Faculty of Informatics at Czech

February 8, 2021 - by Veronika Maurerova
Data to Production Ready Models to Business Apps in Just a Few Steps

Building a Credit Scoring Model and Business App using H2O In the journey of a successful

February 5, 2021 - by Shivam Bansal
Using Python’s datatable library seamlessly on Kaggle

Managing large datasets on Kaggle without fearing about the out of memory error Datatable is a Python

February 3, 2021 - by Parul Pandey and Rohan Rao
Fallback Featured Image
Successful AI: Which Comes First, the Data or the Question?

Successful AI is a business process. Even the most sophisticated models, the latest algorithms, and highly

February 2, 2021 - by Ellen Friedman, PhD

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img