October 16th, 2018

How This AI Tool Breathes New Life Into Data Science

RSS icon RSS Category: Beginners, Data Journalism, Data Science, Deep Learning, Driverless, Explainable AI, GPU, H2O Driverless AI, Machine Learning, NLP, Python, R, Technical
Fallback Featured Image

Ask any data scientist in your workplace. Any Data Science Supervised Learning ML/AI project will go through many steps and iterations before it can be put in production. Starting with the question of “Are we solving for a regression or classification problem?”

  1. Data Collection & Curation
  2. Are there Outliers? What is the Distribution? What do we do with what we have?
  3. Feature Engineering & Class imbalance adjustments – Encoding/Munging, etc. How to handle text, numeric, categorical, binary data types?
  4. Careful Separation of Training, Validation and Test data sets
  5. Algorithm and Model Selection, Tuning and Cross-Validation
  6. Run it on a cloud or on-prem platform where algorithms (not just Tensorflow) run parallel on GPUs (100x faster). How about using the latest Gradient Boosting and XGBoost algorithms, parallelized on GPUs?
  7. Are we avoiding Overfitting?
  8. Bonus points – Run multiple Ensembles from the best winning algorithms and its parameters and choose the winning ensemble.
  9. Build Scoring pipeline to be used right away in production just with some basic hooks
  10. Repeat and Rinse above every-time you get new data or when models decay.
  11. Explain to business how every model is doing, what it is doing – ALL THE TIME.

Every step above has a myriad of challenges. Today, all the above are very painful to do, even for the best data scientists in a team. If your data has 200 columns are features that are a mix of numeric, categorical and text data, the problem becomes exponentially hard. Even the savvy data scientists avoid running exhaustive tests, procedures and scientific methods and try stick to basic stuff or may decide to automate one or two steps above. The most automation you get today is available from a few cloud/SAAS vendors – they allow you to choose the algorithm with your base features and it will do some repetitive hyper-parameter tuning to get you the best model that you can use in production.

In reality, to do above right, it takes a lot more other than repetitive model tuning or using one algorithm. We are not even talking about a big data problem – even with just 1 million rows, you could be spending days, weeks to a month to get an AUC measure of 0.94 or a really low LOG LOSS with no overfitting, only to find your next batch of data has you chasing this problem all over again.

What if you had a tool that solves the above end-to-end problem using AI?

Driverless AI from H2O (AI to do AI) 

Load your ground truth CSV file or point to your data source and push a button. The tool does all the steps like feature engineering, evolution, etc., multiple algorithms on GPUs, etc., etc., under the hood and then outputs the code to put in production, all with an in-depth explainability report! How about getting to the leaderboard in a few minutes or an hour or so with a tool with Kaggle Grandmaster smarts?

A screenshot from above where I was building a classification model with the “Wine Data Set” from UCI ML repository. With 15 seconds of running the classification, you can see the first XGBoost model itself gave me 0.9267 AUC showing me the variable importance in the screen. As driverless AI evolves the features and runs multiple algorithms (running on 8 GPUs in this instance), you can watch how the AUC continuously is improved by tuning LightGBM, XGBoost, TensorFlow, GLM, etc., with hyper-parameter optimization and feature evolution.

The next time I ran it, I used LOGLOSS as my scorer, and it finished with 0.7876 based on my initial settings. The evolution of both models and features is shown in this graph below:

Driverless AI can run on Docker in your PC/Mac or one of EC2, Azure or GCP (with multiple GPUs!) instances in the cloud. After you load your data and finish the experiment, you can deploy a java or python scoring pipeline package at the end that has all the workflow inside. All one has to do is to call the hooks from your production application with your new data stream to get results – whether its’ binary outcome or numeric estimate/forecast or a multi-class decision.

Does Driverless AI replace Data Scientists?

Driverless AI makes Data Scientists super productive and helps them automate the end to end process – just like using any other automated tool to make mundane/repetitive tasks exciting and efficient. The data scientists have options to set up, monitor, influence the model building and override default decisions to accelerate model building or increase accuracy further. Driverless AI can also help data scientists to build models across multiple applications simultaneously and simplify the model deployment/lifecycle chore – even think about how much time a business can save solving various complex ML/AI problems and running it in production – all without the loss of explainability.

Some links:

H2O’s Driverless AI website

Download a 21-day trial from here



About the Author

Karthik Guruswamy

Karthik is a Principal Pre-sales Solutions Architect with H2O. In his role, Karthik works with customers to define, architect and deploy H2O’s AI solutions in production to bring AI/ML initiatives to fruition.

Karthik is a “business first” data scientist. His expertise and passion have always been around building game-changing solutions - by using an eclectic combination of algorithms, drawn from different domains. He has published 50+ blogs on “all things data science” in Linked-in, Forbes and Medium publishing platforms over the years for the business audience and speaks in vendor data science conferences. He also holds multiple patents around Desktop Virtualization, Ad networks and was a co-founding member of two startups in silicon valley.

Leave a Reply

An Introduction to Time Series Modeling:
Time Series Preprocessing and Feature Engineering

Time is the only nonrenewable resource - Sri Ambati, Founder and CEO, H2O.ai. Prediction is very

October 26, 2021 - by Adam Murphy
New Features Now Available with the Latest Release of the H2O AI Hybrid Cloud 21.10

The Makers here at H2O.ai have been busy building new features and enhancing capabilities across

October 18, 2021 - by
Time Series Forecasting Best Practices

Earlier this year, my colleague Vishal Sharma gave a talk about time series forecasting best

October 15, 2021 - by Jo-Fai Chow
Improving NLP Model Performance with Context-Aware Feature Extraction

I would like to share with you a simple yet very effective trick to improve

October 8, 2021 - by Jo-Fai Chow
Feature Transformation with the H2O AI Hybrid Cloud

It is well known throughout the data science community that data preparation, pre-processing, and feature

October 7, 2021 - by Benjamin Cox
Introducing DatatableTon – Python Datatable Tutorials & Exercises

Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data

September 20, 2021 - by Rohan Rao

Start your 14-day free trial today