June 10th, 2015

Scaling R with H2O

Category: Uncategorized
summary_step1

In the advent of H2O 3.0 it seems appropriately timed to reintroduce the R API for H2O to help users better understand the differences between R dataframes and H2OFrames. Typically some of the first questions we get include:

  • Does H2O support all R packages and functions?
  • Is H2OFrame an extension of data.frame?
  • Are H2O supported algorithms written on top of preexisting packages in R like glmnet?

Reading in Data

S4 object H2OFrame is a tabular representation of data that has been imported and parsed into H2O’s Distributed Key-Value Store. The object holds information about where the H2O cluster is : conn, what the frame is called on the cluster : frame_id, and the first 10 rows used when a print of the frame is called. When you use the H2OFrame object in a supported function, R simply acts as glue language that allows you to write R code that really makes a call to the Java backend for the expression(s) to be computed.
For example, the user can execute a h2o.importFile command to access a remote H2O cluster and specify the path to import data from. This command gets sent over H2O’s REST API to hit the endpoint on the Java side. Once the import is completed, H2O will return a JSON response that gets summarized into the components described earlier. So most importantly the difference between an R data frame is that it sits in-memory in R while a H2OFrame is just a reference to an object in the DKV.
parse_step1
parse_step2
parse_step3

Summarizing New Frame

Similarly once the user have a data frame and want to summarize it, execute the h2o.summary or summary function. The command will make a call to H2O to execute the MR task which returns a JSON response that is parsed into a table object in R. The input and actual execution for H2O’s summary function is different than the base summary function but the output is still a table object that generic summary returns.
summary_step1
summary_step2

Supported R functions

For a list of all the functions you can apply on the H2OFrame you can bring up the package documentation in the R console by executing ?h2o which will bring up all supported functions.
In short all H2O functions are prefixed with h2o so the user understand that though there are similarities between H2O’s syntax and base R’s syntax, they are essentially different functions.
All of H2O’s algorithms are executed in memory as java tasks so there is no work done in R. When the user calls glmnet, glm, or h2o.glm you are accessing different implementations of GLM. There are however functions that has been overloaded as methods for H2OFrames such as h2o.summary which is also accessible as summary. Unary and binary operation on the frame would not have the h2o prefix as well and you can access a list of these supported operators by executing ?H2OFrame-class in the R console.
Below are some examples of parity of base R functions and H2O specific R functions. The package was written so that there is an equivalence of typical R operations or expressions that can be sent to be computed on the Java side for the most frequently used R functions.
R_H2O_Parity

Examples of Transformation of Data Frame

For an example of how you might perform some simple transformations on your H2OFrame please download a small airlines data and run through the following example:
To start import the data and run the summary function.

## Load library and initialize h2o
library(h2o)
conn <- h2o.init(nthreads = -1)
pathToAirlines <- normalizePath("~/Downloads/allyears2k.csv")
airlines.hex <- h2o.importFile(conn = conn, path = pathToAirlines, destination_frame = "airlines.hex")
## Summary stats, histogram plots
summary(airlines.hex)

Then we want to create a feature indicating how long each trip took. To calculate the trip duration just take the difference between DepTime and ArrTime which were parsed as numerics. So to both we’ll extract the hour and minute and convert it to total time (in minutes) elapsed since 12:00AM. Finally take the difference between arrival and departure time and append it to the airlines frame. The beauty of the h2o package is how easy it is to translate a R code to a H2O+R code. The following parameter creation are the exact commands you would run if you have a R data.frame that was parsed in using read.csv.

## Create trip_duration feature
hour1 <- airlines.hex$ArrTime %/% 100
mins1 <- airlines.hex$ArrTime %% 100
arrTime <- hour1*60+mins1
hour2 <- airlines.hex$DepTime %/% 100
mins2 <- airlines.hex$DepTime %% 100
depTime <- hour2*60+mins2
## Take the difference between the two times and assign it to a new feature in frame
airlines.hex$trip_duration <- depTime - arrTime

Leave a Reply

New features in H2O 3.22

Xia Release (H2O 3.22) There's a new major release of H2O and it's packed with new

November 12, 2018 - by Jo-Fai Chow
Top 5 things you should know about H2O AI World London

We had a blast at H2O AI World London last week! With a record number

November 6, 2018 - by Bruna Smith
Fallback Featured Image
Anomaly Detection with Isolation Forests using H2O

Introduction Anomaly detection is a common data science problem where the goal is to identify odd

November 6, 2018 - by angela
Fallback Featured Image
Launching the Academic Program … OR … What Made My First Four Weeks at H2O.ai so Special!

We just launched the H2O.ai Academic Program at our sold-out H2O AI World London. With

October 30, 2018 - by Conrad
Welcome H2O.ai’s new Driverless AI Community!

I am very excited to announce the formation of the inaugural community for H2O Driverless

October 30, 2018 - by Rafael Coss
Fallback Featured Image
How This AI Tool Breathes New Life Into Data Science

Ask any data scientist in your workplace. Any Data Science Supervised Learning ML/AI project will

October 16, 2018 - by Saurabh Kumar

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img