Return to page

BLOG

Scaling R with H2O

 headshot

By H2O.ai Team | minute read | June 10, 2015

Category: Uncategorized
Blog decorative banner image

In the advent of H2O 3.0 it seems appropriately timed to reintroduce the R API for H2O to help users better understand the differences between R dataframes and H2OFrames. Typically some of the first questions we get include:

  • Does H2O support all R packages and functions?
  • Is H2OFrame an extension of data.frame?
  • Are H2O supported algorithms written on top of preexisting packages in R like glmnet?

Reading in Data

S4 object H2OFrame is a tabular representation of data that has been imported and parsed into H2O’s Distributed Key-Value Store. The object holds information about where the H2O cluster is : conn, what the frame is called on the cluster : frame_id, and the first 10 rows used when a print of the frame is called. When you use the H2OFrame object in a supported function, R simply acts as glue language that allows you to write R code that really makes a call to the Java backend for the expression(s) to be computed.
For example, the user can execute a h2o.importFile command to access a remote H2O cluster and specify the path to import data from. This command gets sent over H2O’s REST API to hit the endpoint on the Java side. Once the import is completed, H2O will return a JSON response that gets summarized into the components described earlier. So most importantly the difference between an R data frame is that it sits in-memory in R while a H2OFrame is just a reference to an object in the DKV.
parse_step1 
parse_step2 
parse_step3 

Summarizing New Frame

Similarly once the user have a data frame and want to summarize it, execute the h2o.summary or summary function. The command will make a call to H2O to execute the MR task which returns a JSON response that is parsed into a table object in R. The input and actual execution for H2O’s summary function is different than the base summary function but the output is still a table object that generic summary returns.
summary_step1 
summary_step2 

Supported R functions

For a list of all the functions you can apply on the H2OFrame you can bring up the package documentation in the R console by executing ?h2o which will bring up all supported functions.
In short all H2O functions are prefixed with h2o so the user understand that though there are similarities between H2O’s syntax and base R’s syntax, they are essentially different functions.
All of H2O’s algorithms are executed in memory as java tasks so there is no work done in R. When the user calls glmnet, glm, or h2o.glm you are accessing different implementations of GLM. There are however functions that has been overloaded as methods for H2OFrames such as h2o.summary which is also accessible as summary. Unary and binary operation on the frame would not have the h2o prefix as well and you can access a list of these supported operators by executing ?H2OFrame-class in the R console.
Below are some examples of parity of base R functions and H2O specific R functions. The package was written so that there is an equivalence of typical R operations or expressions that can be sent to be computed on the Java side for the most frequently used R functions.
R_H2O_Parity 

Examples of Transformation of Data Frame

For an example of how you might perform some simple transformations on your H2OFrame please download a small airlines data  and run through the following example:
To start import the data and run the summary function.

## Load library and initialize h2o
library(h2o)
conn <- h2o.init(nthreads = -1)
pathToAirlines <- normalizePath("~/Downloads/allyears2k.csv")
airlines.hex <- h2o.importFile(conn = conn, path = pathToAirlines, destination_frame = "airlines.hex")
## Summary stats, histogram plots
summary(airlines.hex)

Then we want to create a feature indicating how long each trip took. To calculate the trip duration just take the difference between DepTime and ArrTime which were parsed as numerics. So to both we’ll extract the hour and minute and convert it to total time (in minutes) elapsed since 12:00AM. Finally take the difference between arrival and departure time and append it to the airlines frame. The beauty of the h2o package is how easy it is to translate a R code to a H2O+R code. The following parameter creation are the exact commands you would run if you have a R data.frame that was parsed in using read.csv.

## Create trip_duration feature
hour1 <- airlines.hex$ArrTime %/% 100
mins1 <- airlines.hex$ArrTime %% 100
arrTime <- hour1*60+mins1
hour2 <- airlines.hex$DepTime %/% 100
mins2 <- airlines.hex$DepTime %% 100
depTime <- hour2*60+mins2
## Take the difference between the two times and assign it to a new feature in frame
airlines.hex$trip_duration <- depTime - arrTime
 headshot

H2O.ai Team

At H2O.ai, democratizing AI isn’t just an idea. It’s a movement. And that means that it requires action. We started out as a group of like minded individuals in the open source community, collectively driven by the idea that there should be freedom around the creation and use of AI.

Today we have evolved into a global company built by people from a variety of different backgrounds and skill sets, all driven to be part of something greater than ourselves. Our partnerships now extend beyond the open-source community to include business customers, academia, and non-profit organizations.