February 25th, 2020
AI & ML Platforms: My Fresh Look at H2O.ai TechnologyRSS Share Category: AutoML, Beginners, Business, Cloud, Community, Data Science, Driverless AI, Explainable AI, Guest Posts, H2O, Machine Learning, NLP, Open Source, Recommendations, Sparkling Water
By: Ellen Friedman, PhD
2020: A new year, a new decade, and with that, I’m taking a new and deeper look at the technology H2O.ai offers for building AI and machine learning systems.
I’ve been interested in H2O.ai since its early days as a company (it was 0xdata back then) in 2014. My involvement had been only peripheral, but now I’ve begun to work with this company. Just as nieces, nephews or the children of friends that you don’t see every day may really grow up fast without you noticing, the H2O.ai technology has become a very impressive force in AI and machine learning over that time period. I’ve been pleased to see that it is now more comprehensive, even faster and much more widespread than I had realized.
The main impression I have about H2O.ai technology today is that it has evolved to be very customer-centric, which is a good thing. Many of the changes it has undergone make it more practical across a wide range of business settings, with options that work well for beginning or less experienced teams as well as very powerful and sophisticated approaches that advanced data scientists look for. Both groups will also be pleasantly surprised at advances that make feature engineering, model selection, management and deployment into production more efficient and less time consuming than expected. Those traits are key to making AI and machine learning practical in the real world. They expand accessibility of AI and machine learning to a wider group of users.
Popularity isn’t the best or only reason to choose an AI platform, but it does suggest that there’s something very attractive in the technology and a community around it. So it’s worth noting that just one aspect of H2O.ai technology, the H2O open source machine learning and AI platform called H2O-3, is used by over 20,000 organizations, including half of the Fortune 500 companies. The key question is, Why?
In this brief article, we’ll take a look at what the H2O.ai platforms of today are, what they can do and, most importantly, what you might want to do with them.
What is H2O?
H2O is both a technology and a robust community of users and developers worldwide. Let’s start with the technology. What exactly is included?
H2O.ai has both free and open source offerings as well as a choice of enterprise-grade products and services (some very new) that can be licensed. These are built by an expert worldwide team of engineers and data scientists with the company H2O.ai, headquartered in Mountain View, CA.
This figure and the following discussion give you an overview of what H2O.ai offers:
An overview of key technology and services offered by H2O.ai.
Let’s start with H2O-3. Most simply, H2O-3 is highly scalable, distributed, in-memory, very fast open source software technology for building AI and machine learning (ML) systems. It runs on small data sets and scales to large data sets from a wide variety of data sources. It can run on-premises or in the cloud and provides production-ready artifacts for deployment (POJO/MOJO as described below). The H2O-3 offerings provide a framework that makes it easier and more practical for you to build machine learning and AI systems quickly. You retain control over what is happening, but you don’t have to keep track of everything yourself. Because of its ability to handle huge amounts of data easily, H2O-3 is widely used across a variety of use cases in many different sectors. In the following section of this post, you’ll find out more about what H2O-3 does.
H2O-3 also includes automatic machine learning capabilities, referred to as H2O AutoML, described in the next section of this post.
User experience is also addressed by Sparkling Water, an open source software developed by H2O.ai that enables the general purpose in-memory data processing system Apache Spark interoperate seamlessly with the in-memory technology for scalable machine learning in H2O-3. Sparkling Water does not use Apache Spark to do machine learning. Instead, Sparkling Water provides integration between Spark and H2O-3. A typical scenario is that a data scientist will use Spark to do data prep and then use Sparkling Water to move from a Spark DataFrame to an H2O-3 Frame. Once you have an H2O-3 Frame, you continue to train and tune the model(s) using H2O-3. Think of Sparkling Water as a bridge from an Spark DataFrame to an H2O Frame.
Many users of H2O-3 or Sparkling Water find it useful to purchase Enterprise Support, particularly as they prepare and deploy their machine learning and AI systems in production. For users wanting to access data stored in HDFS (Hadoop distributed data storage) for their machine learning and AI projects, H2O.ai offers Enterprise Steam. Not only does Steam simplify H2O.ai adoption, this service is particularly useful in managing H2O-3 securely, in a multi-tenant architecture.
That brings us to an enterprise offering that was new to me and that really shows how the technology offered by H2O.ai has evolved: H2O Driverless AI. This platform was built from scratch to address several key needs that will be readily recognized by experienced data scientists, including:
- Speeding up effective feature engineering through automation
- Making model tuning and selection more efficient and faster
- Providing a practical interface for interpretability of model behavior
- Simplifying deployment: train once and run anywhere
- Providing an open and extensible automatic machine learning platform
Most appealing and perhaps surprising is the fact that H2O Driverless AI is easily customizable even though it automates key tasks to speed up the process of developing a machine learning system. The data scientist retains control of what happens, but his or her output is increased by automating some otherwise laborious tasks. This customizable automation frees up time and attention for the data scientist to focus on experimentation, performance and even new ways to use machine learning. In the following section of this post, you can read more about how Driverless AI does this.
By the way, you can use open source H2O-3 or H2O Driverless AI in any public or private cloud. For example, you can drag-and-drop data for your models from sources such as Google Big Query, AmazonS3 or Azure Blob Storage. Here is a list of connectors that make this possible. Enterprise Puddle lets you launch H2O.ai products on virtual private cloud (VPC), currently supporting Azure and Amazon Web Service (AWS), with Google Cloud Platform (GCP) expected soon.
There’s also a new offering – new entirely, not just new to me – called H2O Q. H2O Q was mentioned at the H2O World conference in New York last fall. H2O Q is good news for business analysts, people new to data science and to essentially all business users. Q is an AI platform that makes it practical for business users to gain insights “in the moment,” fitting the way people are used to thinking. You can hear from an existing customer about H2O Q in this short video. Or apply for early access to H2O Q yourself.
We have seen an overview of what H2O.ai technology actually is. I will talk more about H2O.ai’s newest offering, Q, in future posts, but here I give you a slightly more detailed look at what H2O-3 and H2O Driverless AI can do.
What Do H2O-3 and H2O Driverless AI Do?
Open source H2O (H2O-3) requires Java but also lets you use familiar programming languages including Python, R or Scala to build machine learning models for both supervised and unsupervised approaches. Alternatively, you can build models in H2O-3 without ever writing code, by using H2OFlow. The latter is an open source interactive user interface in a web-based environment.
H2O-3 provides a rich collection of algorithms that were developed from the ground up to provide excellent performance on large scale distributed systems. This includes supervised techniques such as regression and classification and unsupervised learning in the form of clustering. H2O-3 algorithms include Random Forest, GLM, GBM, XG Boost and more.
What about data sources? H2O-3 wants tabular data that can be read from a variety of popular data sources including Amazon S3, Azure DataLake, HDFS (Cloudera/Hortonworks) or MapR. The following diagram provides some context for the H2O-3 platform features:
This open source platform has AutoML capabilities (H2O-3 AutoML) that run through algorithms and their hyperparameters automatically to produce a leaderboard of best models. Learn more about the H2O-3 AutoML interface in the documentation. You may also want to see this blog post, “A Deep Dive into H2O’s AutoML”.
How do you productionize your machine learning models?
Machine learning and AI only bring value once they are in production, so an important characteristic of H2O-3 is ease of productionizing machine learning models by deploying with POJOs (Plain Old Java Object) or binary format MOJOs (Model Object, Optimized). You’ll find out more about productionizing H2O here.
A great resource to learn in detail about the open source H2O platform is the overview and menu from H2O documentation. Or get the book titled Practical Machine Learning with H2O, a 300 page deep-dive by Darren Cook (published December 2016). The book is available from O’Reilly Media or Amazon. You also can try a free tutorial on machine learning with H2O provided by H2O.ai.
That brings us to the enterprise H2O Driverless AI platform and what it does. Driverless AI goes even further in improving efficiency and output of a data science team. Here’s how:
H2O Driverless AI addresses a class of problems in supervised machine learning, primarily using classification and regression techniques. At present, it is not used for unsupervised machine learning, but a vast majority of use cases employ supervised learning, so this makes Driverless AI widely useful. As to improving efficiency, H2O Driverless AI not only makes development time much shorter in general, it also improves accuracy of the learning systems being developed. AI and machine learning require a lot of experimentation and trial-and-error to build effective systems, so another way to describe efficiency is that that Driverless AI reduces the time needed to get to a desired level of accuracy.
This innovative enterprise automatic machine learning platform employs state-of-the-art automatic feature engineering to detect potentially useful features in a given dataset and test how they interact, as well as testing their relative importance for model creation and performance. The automatic feature engineering capabilities of H2O Driverless AI also help you derive new features. Once the most appropriate features are selected, they are transformed in a way that they can easily be used by machine learning algorithms of interest. These techniques are called transformers.
Find out more about automatic feature engineering with H2O Driverless AI watching this video of an H2O World London presentation by Dmitry Larko, Senior Data Scientist at H2O.ai.
Feature engineering is just one group of essential tasks that normally are iterative, require special skills and expertise, and typically are very time consuming in the development of machine learning and AI systems. H2O Driverless AI changes all that by also automating key tasks such as model validation, model tuning, model selection and deployment. Thus H2O Driverless AI reduces the AI development timeline from months to minutes or hours. This capability of Driverless AI was the aspect of new H2O.ai offerings that surprised me the most. I heard from experienced data scientists among the H2O.ai customer base who had been skeptical at first. They worried that automation might not be able to meet their requirements, but once they put their hands on H2O Driverless AI, they reported being immediately impressed by how much faster their work could be.
Keep in mind that automation of these tasks does not mean loss of control by the data scientist. Driverless AI delivers an excellent blend of automation and customization. Data scientists determine how they will develop their machine learning systems in order to take best advantage of their expertise. The H2O Driverless AI platform can also be extended by uploading custom recipes with their own models, transformers and scorers. Use this Bring Your Own Recipe option or choose from a collection of open source recipes provided by the large H2O.ai community and curated by experts at H2O.ai. You’ll find a wealth of information about how data scientists are using these recipes plus a video on the topic by Arno Candel on this Recipes for Driverless AI resource.
There’s much more to know about what H2O Driverless AI can do, but one of the strengths that you certainly will want to explore is its capability to provide explainable AI.
Screenshot from a short video about how to get explainable AI: a selection of tools in the H2O Driverless AI platform that show you what is happening in AI-based decisions. Watch the 7.5 minute video for more.
You can also delve deeper into the topic of explainable AI through this excellent free book An Introduction to Interpretable Machine Learning, Second Edition by Patrick Hall and Navdeep Gill (© 2019 O’Reilly Media).
What Can You Use H2O Technology to Do?
What does successful AI and machine learning look like? That depends on what you want (and need) to do in your particular business or research setting. The offerings from H2O.ai address a wide variety of problems across a range of industries including, but not limited to, financial services, marketing, healthcare, telecom, insurance, retail and more. You’ll find an overview by industry of what you can use H2O technology to do useful. Or you may find it helpful to explore customer use cases such as:
- Fraud mitigation and customer retention (Insurance)
- Improving clinical workflow and predicting hospital acquired infections (Healthcare)
- Anomaly detection and know-your-customer/client (Financial Services)
- Understanding customer churn (Telco)
For a complete collection, go to this customer use case resource.
H2O.ai has a vibrant community who meet regularly in a variety of venues worldwide. There are over 150,000 members of the H2O.ai meetup group. Check out the announcements for events in your area or follow @h2oai on Twitter to find out more about meetup events and free webinars.
In addition to meetups, there are larger H2O World conferences that bring together business customers, newcomers and data scientists including Kaggle grandmasters. I was fortunate to attend H2O World New York in October 2019. One of the things that struck me as unusual was the wide range of experience and sectors of the attendees, from those who are just beginning to use machine learning and AI to highly experienced data scientists who are used to bringing AI to complex, sophisticated problems. Surprisingly, H2O.ai technology often can be an appropriate choice in these widely different situations, depending on which capabilities are put to work.
Charles Elkan of Goldman Sachs on stage at H2O World New York in October 2019.
The description in this blog posting of what is offered by H2O.ai is really just the tip of the iceberg (frozen water?) but it gives you a starting point to see what is possible.
The best next steps are for you to put your hands on H2O.ai tools and see what you think. You’ll find a collection of free AI Tutorials for H2O-3, Sparkling Water and H2O Driverless AI, provided by H2O.ai. And you can sign up to take H2O Driverless AI out for a free 21-day spin. Or be an explorer by applying for early access to the newest offering, the H2O Q platform.
Whatever you try next, good luck on your AI and machine learning journey!