By: Avni Wadhwa
In today’s market, there aren’t enough data scientists to satisfy the growing demand for people in the field. With many companies moving towards automating processes across their businesses (everything from HR to Marketing), companies are forced to compete for the best data science talent to meet their needs. A report by McKinsey says that based on 2018 job market predictions: “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.” H2O’s Driverless AI addresses this gap by democratizing data science and making it accessible to non-experts, while simultaneously increasing the efficiency of expert data scientists. Its point-and-click UI minimizes the complicated legwork that precedes the actual model build.
Driverless AI is designed to take a raw dataset and run it through a proprietary algorithm that automates the data exploration/feature engineering process, which typically takes ~80% of a data scientist’s time. It then auto-tunes model parameters and provides the user with the model that yields the best results. Therefore, experienced data scientists are spending far less time engineering new features and can focus on drawing actionable insights from the models Driverless AI builds. Lastly, the user can see visualizations generated by the Machine Learning Interpretability (MLI) component of Driverless AI to clarify the model results and the effect of changing variables’ values. The MLI feature eliminates the black box nature of machine learning models and provides clear and straightforward results from a model as well as how changing features will alter results.
Driverless AI is also GPU-enabled, which can result in up to 40x speed ups. We had demonstrated GPU acceleration to achieve those speedups for machine learning algorithms at GTC in May 2017. We’ve ported over XGBoost, GLM, K-Means and other algorithms to GPUs to achieve significant performance gains. This enable Driverless AI to run thousands of iterations to find the most accurate feature transforms and models.
The automatic nature of Driverless AI leads to increased accuracy. AutoDL engineers new features mechanically, and AutoML finds the right algorithms and tunes them to create the perfect ensemble of models. You can think of it as a Kaggle Grandmaster in a box. To demonstrate the power of Driverless AI, we participated in a bunch of Kaggle contests and the results are here below. Driverless AI out of the box got performed nearly as well as the best Kaggle Grandmasters
Let’s look at an example: we are going to work with a credit card dataset and predict whether or not a person is going to default on their payment next month based on a set of variables related to their payment history. After simply choosing the variable we are predicting for as well as the number of iterations we’d like to run, we launch our experiment.
As the experiment cycles through iterations, it creates a variable importance chart ranking existing and newly created features by their effect on the model’s accuracy.
In this example, AutoDL creates a feature that represents the cross validation target encoding of the variables sex and education. In other words, if we group everyone who is of the same sex and who has the same level of education in this dataset, the resulting feature would help in predicting whether or not the customer is going to default on their payment next month. Generating features like this one usually takes the majority of a data scientist’s time, but Driverless AI automates this process for the user.
After AutoDL generates new features, we run the updated dataset through AutoML. At this point, Driverless AI builds a series of models using various algorithms and delivers a leaderboard ranking the success of each model. The user can then inspect and choose the model that best fits their needs.
Lastly, we can use the Machine Learning Interpretability feature to get clear and concise explanations of our model results. Four dynamic graphs are generated automatically: KLime, Variable Importance, Decision Tree Chart, and Partial Dependence Plot. Each one helps the user explore the model output more closely. KLIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate GLMs on samples formed from K-Means clusters in the training data. All penalized GLM surrogates are trained to model the predictions of the Driverless AI model. The Variable Importance measures the effect that a variable has on the predictions of a model, while the Partial Dependence Plot shows the effect of changing one variable on the outcome. The Decision Tree Surrogate Model clears up the Driverless AI model by displaying an approximate flow-chart of the complex Driverless AI model’s decision making process. The Decision Tree Surrogate Model also displays the most important variables in the Driverless AI model and the most important interactions in the Driverless AI model. Lastly, the Explanations button gives the user a plain English sentence about how each variable effects the model.
All of these graphs can be used to visualize and debug the Driverless AI model by comparing the displayed decision-process, important variables, and important interactions to known standards, domain knowledge, and reasonable expectations.
Driverless AI streamlines the machine learning workflow for inexperienced and expert users alike. For more information, click here.