July 9th, 2013

The MillionSongs Data Part 1: Bells and Whistles of GLM in H2O

Category: Uncategorized
Fallback Featured Image

Using the Million Songs Data Set I want to go from beginning to end through H2O's GLM tool. Note that the original data are large, so downloading and fiddling with the full data set can be quite painful if you just do it from your desktop, that said you can find it here.  It’s a good opportunity to take a really detailed look at H2O so that you can get the most bump from the trunk (so to speak).

To start, let’s assume you’ve decided that GLM is the method for you. You’ve launched H2O, parsed your data and chosen GLM from the drop down menu under “Model”.
Destination Key  – this is an automatically generated key for your model; it will allow you to recall this specific model and all of its details later in your analysis. While H2O will spit a key out for you, you can also specify a model name such that later you can identify which of many models you are interested in revisiting.
Key – this is the .hex key generated when you parsed your data into H2O. If you didn’t save it at the time it’s no biggie. The .hex is named whatever your original data file was named, save for the change in extension. If you begin typing the name of your original file, you will be given the option to tab auto-complete. If you want to find the key yourself you can do so by going to the drop down menu “Admin”, select “Jobs” and under description find “Parse”. The key for your data of interest is given in the “Destination key” field, and is a clickable link that allows you to inspect your data.
Y – Your dependent variable.
X – Once you identify your dependent variable (the value you would like to predict) in the Y field, the X field will auto populate with all possible options (all of your other variables).  You select the subset of variables that you would like to use to predict with.
Family – Under family you will see a drop down menu with choices. Each of the four options differs in the assumptions you make about your dependent (Y) variable – the variable you would like to predict. They are explained in some detail below.
Link – Each family is associated with a default link function, which defines the specialized transformation on the set of X variables chosen to predict Y.

Family Default Link Description and Example
Gaussian Identity Your dependent variables (Y) are quantitative, continuous (or continuous predicted values can be meaningfully interpreted), and expected to be normally distributed.EX: The average length of a song in seconds or the average purchase price of a product.
Binomial Logit Your dependent variables take on two values, traditionally coded as 0 and 1, and follow a binomial distribution. Choose this if you have a categorical Y with two possible outcomes.EX: Customer decides to purchase or notA song is played or not played
Poisson Log Your dependent variable is a count – a quantitative, discrete value that expresses the number of times some event occurred.EX: The number of customers visiting a website over time, the number of customers visiting a store over distance
Gamma Inverse Your dependent variable is a survival measure – that is, you have some measure of the duration of a process for which the outcome is variable.EX: The length of time an individual remains a customer, the length of time before a particular product feature fails

Lambda: H2O provides a default value, but this can also be user defined. Lambda is a regularization parameter that is designed to prevent overfitting. The best value of lambda depends on the degree to which you wish the variance of the cross validated coefficients to match.
Alpha:   A user defined tuning regularization parameter that H2O sets to 0.5 by default, but which can take any value between 0 and 1, inclusive.  It functions so that there is an added penalty taken against the estimated fit of the model as the number of parameters increases. An alpha of 1 is the lasso penalty, and an alpha of 0 is the ridge penalty.
Lambda and alpha are distinct in purpose in that lambda is primarily concerned with preventing overfitting and thus increasing the generalizability of any specific coefficient in your model, where alpha is concerned with the model overall. 
N-Folds: The number of cross validations you would like H2O to generate. Choosing 10 means that ten random samples of observations from your orginal data will be selected and models will be fit to those subsets as well. It’s important to note that the smaller your orginal data are the larger the variation you can expect to see in the parameter estimates provided in the cross validation models; for sufficiently small data sets you may want to choose a different evaluative criteria.
Expert Settings: for the moment I would like to leave expert settings, except to note that this is the option you choose if you would like to standardize your data. In data where there is a substantial difference in the scale of your input variables standardizing can greatly improve the interpretability of your results.

Leave a Reply

New features in H2O 3.22

Xia Release (H2O 3.22) There's a new major release of H2O and it's packed with new

November 12, 2018 - by Jo-Fai Chow
Top 5 things you should know about H2O AI World London

We had a blast at H2O AI World London last week! With a record number

November 6, 2018 - by Bruna Smith
Fallback Featured Image
Anomaly Detection with Isolation Forests using H2O

Introduction Anomaly detection is a common data science problem where the goal is to identify odd

November 6, 2018 - by angela
Fallback Featured Image
Launching the Academic Program … OR … What Made My First Four Weeks at H2O.ai so Special!

We just launched the H2O.ai Academic Program at our sold-out H2O AI World London. With

October 30, 2018 - by Conrad
Welcome H2O.ai’s new Driverless AI Community!

I am very excited to announce the formation of the inaugural community for H2O Driverless

October 30, 2018 - by Rafael Coss
Fallback Featured Image
How This AI Tool Breathes New Life Into Data Science

Ask any data scientist in your workplace. Any Data Science Supervised Learning ML/AI project will

October 16, 2018 - by Saurabh Kumar

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img