October 16th, 2018
How This AI Tool Breathes New Life Into Data ScienceShare Category: Beginners, Data Journalism, Data Science, Deep Learning, Driverless, Driverless AI, Explainable AI, GPU, Machine Learning, NLP, Python, R, Technical
By: Karthik Guruswamy
Ask any data scientist in your workplace. Any Data Science Supervised Learning ML/AI project will go through many steps and iterations before it can be put in production. Starting with the question of “Are we solving for a regression or classification problem?”
- Data Collection & Curation
- Are there Outliers? What is the Distribution? What do we do with what we have?
- Feature Engineering & Class imbalance adjustments – Encoding/Munging, etc. How to handle text, numeric, categorical, binary data types?
- Careful Separation of Training, Validation and Test data sets
- Algorithm and Model Selection, Tuning and Cross-Validation
- Run it on a cloud or on-prem platform where algorithms (not just Tensorflow) run parallel on GPUs (100x faster). How about using the latest Gradient Boosting and XGBoost algorithms, parallelized on GPUs?
- Are we avoiding Overfitting?
- Bonus points – Run multiple Ensembles from the best winning algorithms and its parameters and choose the winning ensemble.
- Build Scoring pipeline to be used right away in production just with some basic hooks
- Repeat and Rinse above every-time you get new data or when models decay.
- Explain to business how every model is doing, what it is doing – ALL THE TIME.
Every step above has a myriad of challenges. Today, all the above are very painful to do, even for the best data scientists in a team. If your data has 200 columns are features that are a mix of numeric, categorical and text data, the problem becomes exponentially hard. Even the savvy data scientists avoid running exhaustive tests, procedures and scientific methods and try stick to basic stuff or may decide to automate one or two steps above. The most automation you get today is available from a few cloud/SAAS vendors – they allow you to choose the algorithm with your base features and it will do some repetitive hyper-parameter tuning to get you the best model that you can use in production.
In reality, to do above right, it takes a lot more other than repetitive model tuning or using one algorithm. We are not even talking about a big data problem – even with just 1 million rows, you could be spending days, weeks to a month to get an AUC measure of 0.94 or a really low LOG LOSS with no overfitting, only to find your next batch of data has you chasing this problem all over again.
What if you had a tool that solves the above end-to-end problem using AI?
Driverless AI from H2O (AI to do AI)
Load your ground truth CSV file or point to your data source and push a button. The tool does all the steps like feature engineering, evolution, etc., multiple algorithms on GPUs, etc., etc., under the hood and then outputs the code to put in production, all with an in-depth explainability report! How about getting to the leaderboard in a few minutes or an hour or so with a tool with Kaggle Grandmaster smarts?
A screenshot from above where I was building a classification model with the “Wine Data Set” from UCI ML repository. With 15 seconds of running the classification, you can see the first XGBoost model itself gave me 0.9267 AUC showing me the variable importance in the screen. As driverless AI evolves the features and runs multiple algorithms (running on 8 GPUs in this instance), you can watch how the AUC continuously is improved by tuning LightGBM, XGBoost, TensorFlow, GLM, etc., with hyper-parameter optimization and feature evolution.
The next time I ran it, I used LOGLOSS as my scorer, and it finished with 0.7876 based on my initial settings. The evolution of both models and features is shown in this graph below:
Driverless AI can run on Docker in your PC/Mac or one of EC2, Azure or GCP (with multiple GPUs!) instances in the cloud. After you load your data and finish the experiment, you can deploy a java or python scoring pipeline package at the end that has all the workflow inside. All one has to do is to call the hooks from your production application with your new data stream to get results – whether its’ binary outcome or numeric estimate/forecast or a multi-class decision.
Does Driverless AI replace Data Scientists?
Driverless AI makes Data Scientists super productive and helps them automate the end to end process – just like using any other automated tool to make mundane/repetitive tasks exciting and efficient. The data scientists have options to set up, monitor, influence the model building and override default decisions to accelerate model building or increase accuracy further. Driverless AI can also help data scientists to build models across multiple applications simultaneously and simplify the model deployment/lifecycle chore – even think about how much time a business can save solving various complex ML/AI problems and running it in production – all without the loss of explainability.