April 15th, 2019

Building AI/ML models on Lending Club Data, with H2O.ai — Part 2

RSS icon RSS Category: AutoML, Data Journalism, Data Science, Driverless AI

In Part 1 of this series earlier, we looked at how to download data from Lending Club using Jupyter/Python and create a training and test data set, after dropping some target leakage cols. The data preparation code to create the data sets for classification is available in GitHub at: https://git.io/fjTqb

In this blog post, we are going to use H2O-3 AutoML to build a model on the training set and score on the test data set, using the Python client library “h2o”. We will extract features that are important in predicting whether a loan will be “Fully Paid” or “Charged Off”.

First, you need to download the latest python client library (& also a single instance of H2O-3) from this page:

http://h2o-release.s3.amazonaws.com/h2o/rel-yates/1/index.html

Click “Install in Python” and follow instructions.

H2O-3 AutoML can run multiple algorithms, do hyperparameter tuning, cross-validation, create Stack Ensembles on winning algorithms and create a self-contained scoring package, that can be deployed in production.

Algorithms tried by H2O-3 AutoML (as of version 3.24.0.1):

  • DRF — Distributed Random Forest
  • GLM — Generalized Linear Model
  • XGBoost — XGBoost GBM
  • GBM — H2O GBM
  • Deep Learning
  • Stacked Ensemble of above

AutoML can be kicked off in Open Source H2O-3 by either R or Python language interface or by using H2O Flow which is a browser UI to do the interactive model building.

For Lending Club data use case, the python notebook below explains how you’d connect to an H2O-3 cluster in the cloud or local instance, upload the training/test data, kick off AutoML with some basic parameters. It also explains how to view the composition of the AutoML Leader (which is usually a stacked ensemble), run Variable Importance for multiple algorithms in the AutoML leaderboard and analyze the results. There is finally code to predict the outcome loan_status, for the test data set and analyze the test model performance.

The Jupyter Python notebook in this blog post is available from GitHub: https://git.io/fjke6

Run Automatic Machine Learning in a few steps:

AutoML Performance:

One of the things to observe below is how H2O-3 AutoML ran multiple algorithms like XGBoost, GLM, Deep Learning, GBM, etc., Also the top 2 models with the highest AUC are Stacked Ensembles built on the rest of the models in the leaderboard.

How to Gain Insights into the model?

The standardized Coef. Magnitudes of the GLM model in the leaderboard gives us a sense of what’s different about a Loan Getting Paid in Full vs Loan getting Charged Off/Defaulted. The features/attributes in blue are the positive reasons (Length of the bar is the order of importance) why the Loan is getting Paid in Full vs the one in the Orange which can be attributed to Loan defaulting. In summary:

Top 7 Factors that are correlated to Loan getting Fully Paid – in the order of importance (Looking only at the Blue bars):

  • 36_months– If the Loan term is shorter, like 3 years
  • A– If the Loan Grade is “A”
  • total_bc_limit– If the total bank card credit Limit is high
  • mo_sub_old_rev_tl_op– If a lot of months since most recent revolving account opened
  • MORTGAGE– whether a customer had a Home Mortgage Loan open
  • total_il_high_credit_limit– Total installment high credit/credit limit (Kind of %payments to total credit limit)
  • earliest_cr_line– When the first credit line was opened

Top 7 Factors that are correlated to Loan getting Charged Off – in the order of importance (Looking only at the Orange bars):

  • int_rate– Interest Rate was high on the loan
  • 60_months– If the loan term is longer, like 5 years
  • <ABC> –
  • acc_open_past_24_mnts– high # of accounts opened in past 24 months
  • dti– Debt to Income ratio is high
  • issue_d– month/year which a loan was issued
  • RENT– Whether a customer was renter instead of Home Owner

The income/credit/debt characteristics of customers are discovered by the model automatically from the data. However, it’s important that correlation should not be mistaken for causation (which is not the scope of the blog).

As opposed to the GLM model in the leaderboard, you can also walk through each model in the leaderboard and look at variable importance – See code in the original notebook: https://git.io/fjke6

How to Learn more about H2O-3 AutoML ?

For learning more about H2O-3 Open Source AutoML, see link to Erin LeDell’s youtube video below:

 

Summary of Results

The final AUC on the test set was 0.729 above.

The data was a snapshot on time where loans where running (some early stage and some late) and not necessarily “cohorts”. In the data preparation phase, we also dropped lot of columns that was giving away the outcome. The models built are still very useful to understand the drivers behind the outcome. By using additional H2O-3 API, you can download scoring artifacts to productionize the model. So, how to improve the Accuracy and see full Machine Learning Interpretability of the final model etc .,?

Next Steps

H2O3 AutoML can help you build models really quickly and understand the variable importance with very little effort. Recall, we didn’t do any feature engineering (like one-hot-encoding etc.,) at-all to the input data! In the next blog posts, we will explore how to do the following – in addition to Automatic Machine Learning:

  • Automatic Feature Engineering
  • Machine Learning Interpretability

with H2O’s commercial product Driverless AI.

About the Authors

vinod iyengar
Vinod Iyengar

Vinod is VP of marketing and technical alliances at H2O.ai. He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases. He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.

Karthik Guruswamy

Karthik is a Principal Pre-sales Solutions Architect with H2O. In his role, Karthik works with customers to define, architect and deploy H2O’s AI solutions in production to bring AI/ML initiatives to fruition.

Karthik is a “business first” data scientist. His expertise and passion have always been around building game-changing solutions - by using an eclectic combination of algorithms, drawn from different domains. He has published 50+ blogs on “all things data science” in Linked-in, Forbes and Medium publishing platforms over the years for the business audience and speaks in vendor data science conferences. He also holds multiple patents around Desktop Virtualization, Ad networks and was a co-founding member of two startups in silicon valley.

Leave a Reply

Fallback Featured Image
The Importance of Explainable AI

This blog post was written by Nick Patience, Co-Founder & Research Director, AI Applications &

October 30, 2020 - by
Building an AI Aware Organization

Responsible AI is paramount when we think about models that impact humans, either directly or

October 26, 2020 - by
Making AI a Reality

This blog post focuses on the content discussed in more depth in the free ebook

October 16, 2020 - by Ellen Friedman, PhD
H2O on Kubernetes using Helm

Deploying real-world applications using bare YAML files to Kubernetes is a rather complex task, and

October 16, 2020 - by Pavel Pscheidl
H2O Release 3.32 (Zermelo)

There’s a new major release of H2O, and it’s packed with new features and fixes! Among

October 14, 2020 - by Michal Kurka
The Challenges and Benefits of AutoML

Machine Learning and Artificial Intelligence have revolutionized how organizations are utilizing their data. AutoML or

October 14, 2020 - by Eve-Anne Tréhin

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img