By: Michal Kurka
There’s a new major release of H2O, and it’s packed with new features and fixes! Among the big new features in this release, we’ve introduced the ability to define a Custom Loss Function in our GBM implementation, and we’ve extended the portfolio of our machine learning algorithms with the implementation of the SVM algorithm. The release is named after Shing-Tung Yau.
Custom Loss Function for GBM
In H2O, the GBM implementation of distributions and loss functions are tightly coupled. In the algorithm, a loss function is specified using the distribution parameter, and when specifying the distribution, the loss function is automatically selected as well.
Usually, the loss function is symmetric around a value. In some prediction problems, though, asymmetry in the loss function can improve results, for example, when we know that some kind of error is more undesirable than others (usually negative vs. positive error). We can use weights to cover some of these cases, but that is not suitable for more complicated calculations.
In H2O, you can upload a custom distribution and customize the loss function calculation. It is also possible to customize the whole distribution (select the link function, implement the calculation of initializing values, and change the gradient and gamma calculation) and to personalize the predefined distribution (Gaussian distribution for regression, Bernoulli distribution for binomial classification, and Multinomial distribution for multinomial classification) to customize loss function related functions only.
Support-vector machine is an established supervised machine learning algorithm implemented by a number of software tools and libraries (one of the most popular being libsvm). However, the SVM algorithm is challenging to implement in a scalable way so that it could be used on larger datasets and leverage the power of distributed systems. This version of H2O brings our users an implementation of the Parallel SVM (PSVM) algorithm that was designed in order to overcome the scalability issues of SVM and to address the bottlenecks of the traditional implementations. The main idea of the PSVM algorithm is to use a pre-processing step and reduce the amount of memory needed to hold the data by performing a row-based approximate matrix factorization. The PSVM algorithm makes a great fit for our map-reduce framework that internally H2O algorithms.
Our PSVM implementation can currently be used to solve binary classification problems using Radial Basis Function kernel. This complements the linear kernel SVM that H2O users already have available in Sparkling Water. The design of PSVM allows you to relatively easily include new kernel functions, and our users can thus expect to see support for different types of kernels as well as other new features in the coming versions of H2O.
For more details please refer to our documentation.
With this release, you can now generate 2D partial dependency plots with TWO variables instead of just one! The only thing you need to do to enable this is to add a new parameter, col_pairs_2dpdp, and fill it up with a list of variable pairs that you want to plot 2D partial plots with. For example, assume we have a prostate.csv dataset that we used to build a GBM model (called gbm_model), and we are interested in 1D partial plots for columns AGE, RACE, and DCAPS and 2D partial plots for variable pairs AGE, PSA and AGE, RACE. We can generate the 1D partial plots and 2D partial plots for gbm_model in Python using:
gbm_model.partial_plot(data = data, cols = [