March 6th, 2019

Machine Learning with H2O – the Benefits of VMware

RSS icon RSS Category: Cloud, Community, Driverless AI

This blog was originally posted by Justin Murray of VMware and can be accessed here.

 

This brief article introduces a short 4.5 minute video that explains the reasons why VMware vSphere is a great platform for data scientists/engineers to use as their base operating platform. The video then demonstrates an example of this, showing a data scientist conducting a modeling experiment with an input set of data, while using the Driverless AI tool from H2O.ai to do the data analysis and model training, all in VMs. The key idea here is that the world of machine learning/data science is rapidly changing, with new, powerful tools, platforms and versions appearing and upgrading at a very fast pace. The tool vendors are racing to innovate here and producing new workbenches for the both the expert and the novice in the field.

Data scientists and data engineers (who organize and cleanse the data first) want to be able to try out these new tools and updated versions of the tools while keeping a stable environment for their existing production deployments. The end goal is to produce a highly accurate trained ML model as quickly as they can for any input data set and predicted outcome. One measure of accuracy you will see in use at the end of the tool demo is the ROC (or Receiver Operating Characteristic) curve – but other measures of model accuracy are also available. That trained ML model will subsequently be used in production applications for the inference phase. The example model that is chosen here is called XGBoost- a popular algorithm for certain kinds of data. The ML inference phase (or production deployment of the model in a pipeline) is often concerned with classification of something, such as a fraudulent transaction, or prediction of what may happen in the future, such as the likelihood someone will not pay their credit card bill – or will choose a related book or movie. The ML practitioners (data scientists and data engineers) want to use the best tools and platforms they can get their hands on in order to build the best trained model that will be able to recognize these patterns.

This rapid change of tooling places a significant demand on an IT department, just to keep up with the innovation and satisfy their customer, the data scientists and data engineers, by giving them what they want, while maintaining some control. To achieve this, deploying on VMware vSphere gives them the ability to create different sandboxes for the data scientists to work in, each contained in one or more virtual machines. This provides isolation, checkpointing and the ability for the data scientist to innovate in a safe environment.

While many well-known examples of machine learning focus on solving problems to do with image recognition and classification, this particular H2O tool is being used in our demo to analyze tabular data, which happens to be contained in a CSV file in this example. This data is representative of many thousands of datasets that are found in enterprises, such as in database tables, spreadsheets and regular human-readable files. A term that is used frequently here is “independent and identically distributed” data or IID. This kind of data is structured into rows and columns (which is much different to the layout of pixels in an image) so the ML models that best analyze IID/tabular data may well be different to those models that deal with images. Financial institutions, insurance companies, retail operations and dozens of other enterprises have lots of this “tabular” data – so there is a big opportunity here for machine learning to be applied to this type of dataset so as to enhance these enterprises’ business understanding of their customers. These types of data may also be somewhat sensitive and so will likely be modeled in-house for the foreseeable future.

 

About the Author

vinod iyengar
Vinod Iyengar, VP of Products

Vinod is VP of Products at H2O.ai. He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases. He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.

Leave a Reply

AI-Driven Predictive Maintenance with H2O Hybrid Cloud

According to a study conducted by Wall Street Journal, unplanned downtime costs industrial manufacturers an

August 2, 2021 - by Parul Pandey
What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today