In the last couple of weeks we’ve had two meetups on GBM (gradient boosted classification and regression), and hence a lot of excitement about running the algorithm as presented by Cliff, Earl and Dr. Hastie. You can find the hella cool videos of both presentations here: http://www.youtube.com/0xdata
One of my favorite articles on GBM is a great case study from ecology, Elith, Leathwick & Hastie (2008). You can find the original article here: http://onlinelibrary.wiley.com/store/10.1111/j.1365-2656.2008.01390.x/asset/j.1365-2656.2008.01390.x.pdf;jsessionid=5B5FE919D24D8C3EA12FCB74BF352C62.f04t04?v=1&t=hn3iw9wm&s=29c201e8d1d94504ec9e07dcb12bfb2cb539fe7e
The authors kindly made their data and process in R publicly available, so you can get the data and try the model for yourself.
Here is the final model presented – carried out in H2O. Note that data were originally split into testing and training data (called model and eval data respectively in their available download).
The model was originally specified on 14 variables and 1000 observations. The dependent variable is found in column 2, named “Angaus”, and about 80% of the data in the column are 0. In the original paper the family was specified as Bernoulli, with a complexity of 5, and a learning rate of .01.
We recreated the original model in H2O. The specification is depicted below, as well as the output. Note that the X variable field asks for opt out variable specification, and that both the training and testing data sets are set in the model specification page (so your model output is automatically applied to the testing data if you specify it – which is a feature I’m pretty fond of). Also notice that the model is specified as a classification because the dependent variable is a binomial.
And here are the results (I only requested 650 trees – which keeps with the model given in the paper, but it’s pretty trivial to request over 1000. I did it earlier with a 20gig heap and it took about as long as making a cup of coffee .)