March 28th, 2019

Building AI/ML models on Lending Club Data, with H2O.ai — Part 1

RSS icon RSS Category: Beginners, Community, Data Journalism, Data Science, Technical Posts, Tutorials

Lending Club publishes its basic loan databases to the public and a full version to its customers — anonymized of course. You can find the download page from this link (screenshot below):

The publicly downloadable loan data has various attributes — roughly 150+ columns that have categorical, numeric, text and date fields. It also has a ‘loan_status’ text column that indicates if the loan was Fully paid or Charged off. The data makes it ideal to create a binary classification problem with Machine Learning.

In this blog post series, we are going to explore how to do Automatic Machine Learning with the H2O.ai ML product suite. H2O.ai has two Auto ML solutions:

The Open Source version has been around for several years and used by thousands of users and is a scale-out enterprise product. It has basic Auto ML (Automatic Machine Learning) support.

The Driverless AI, however, was just announced last year by H2O.ai for commercial use. The product basically runs on a single instance of a server today with GPUs optionally. Besides Automatic Machine Learning, it has a rich set of features like Automatic Feature Engineering (with > 30 feature transformers including NLP!), Auto-Viz, Auto-Doc, Machine Learning Explainability etc.,

The goal of this blog post series is to show you how to use Automatic Machine Learning and other features using a Jupyter notebook interface. We will use the H2O.ai’s Python client libraries to connect to both H2O-3 Open Source as well as Driverless AI and build AI/ML models and fully take advantage of the capabilities provided.

Data Prep

You can run the Python 3 code below in a Jupyter notebook to create two CSV files — train_lc.csv and test_lc.csv from Lending Club Data.

As part of data cleansing and preparation, we drop some target leakage columns to make sure we get a model that is worthy of production use.

The notebook above is available from here -> https://git.io/fjTqb

In the next blog in this series, we will explore how to kick off Automatic Machine Learning with both H2O-3 Open Source as well as Driverless AI on the training data set, perform scoring on the test data set and compare/contrast the features & results across both products! You can read the second part here.

 

About the Authors

Karthik Guruswamy

Karthik is a Principal Pre-sales Solutions Architect with H2O. In his role, Karthik works with customers to define, architect and deploy H2O’s AI solutions in production to bring AI/ML initiatives to fruition.

Karthik is a “business first” data scientist. His expertise and passion have always been around building game-changing solutions - by using an eclectic combination of algorithms, drawn from different domains. He has published 50+ blogs on “all things data science” in Linked-in, Forbes and Medium publishing platforms over the years for the business audience and speaks in vendor data science conferences. He also holds multiple patents around Desktop Virtualization, Ad networks and was a co-founding member of two startups in silicon valley.

vinod iyengar
Vinod Iyengar, VP of Products

Vinod is VP of Products at H2O.ai. He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases. He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.

Leave a Reply

What are we buying today?

Note: this is a guest blog post by Shrinidhi Narasimhan. It’s 2021 and recommendation engines are

July 5, 2021 - by Rohan Rao
The Emergence of Automated Machine Learning in Industry

This post was originally published by K-Tech, Centre of Excellence for Data Science and AI,

June 30, 2021 - by Parul Pandey
What does it take to win a Kaggle competition? Let’s hear it from the winner himself.

In this series of interviews, I present the stories of established Data Scientists and Kaggle

June 14, 2021 - by Parul Pandey
Snowflake on H2O.ai
H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

One of the goals of machine learning is to find unknown predictive features, even hidden

June 9, 2021 - by Eric Gudgion
Getting the best out of H2O.ai’s academic program

“H2O.ai provides impressively scalable implementations of many of the important machine learning tools in a

May 19, 2021 - by Ana Visneski and Jo-Fai Chow
Regístrese para su prueba gratuita y podrá explorar H2O AI Hybrid Cloud

Recientemente, lanzamos nuestra prueba gratuita de 14 días de H2O AI Hybrid Cloud, lo que

May 17, 2021 - by Ana Visneski and Jo-Fai Chow

Start your 14-day free trial today