Read the Full Transcript
Patrick Moran: Hello, and welcome everybody. Thank you for joining us today. My name is Patrick Moran. I’m on the marketing team here at H2O.ai. I’d love to start off by introducing our speakers.
Vinod Iyengar comes with over seven years of marketing and data science experience in multiple startups. He brings a strong analytical side and a metrics-driven approach to marketing. When he’s not busy hacking, Vinod loves painting and reading. He’s a huge foodie and will eat anything that doesn’t crawl, swim, or move.
Bojan Tunguz was born in Sarajevo, Bosnia and Herzegovina. Having fled to Croatia during the war, Bojan came to the US as a high school exchange student to realize his dream of studying physics. A few years ago, he stumbled upon the wonderful world of data science and machine learning and feels like he discovered a second vocation in life. Some of you may know Bojan through his Kaggle competitions and his Double Kaggle Grandmaster title.
Before I hand it over to Bojan and Vinod, I’d like to go over a few webinar logistics. Please feel free to send us your questions throughout the session via the questions tab, and we’ll be happy to answer them towards the end of the webinar. And secondly, this webinar is being recorded. A copy of the webinar recording and slide deck will be available after the presentation is over. Now, without further ado, I’d like to hand it over to Vinod.
Vinod Iyengar: Thanks, Patrick. Thank you everyone for joining us today for this really fun discussion. I’m really excited to do this webinar with Bojan, who is one of our amazing Kaggle grandmasters. And as the title slide points out, he’s a Double Kaggle Grandmaster, which is a very, very unique thing. I’ll have Bojan explain what that means in a bit.
But before we get started, here’s a quick sort intro to who we are. H2O.ai is an open source machine learning company; we’ve been in business for about seven years now. We have a very large data science community. Nearly 200,000 data scientists and close to 15,000 companies are using us on a regular basis. Close to half of the Fortune 500 companies are also using H2O on a regular basis. We have a huge meetup community, with over one hundred thousand Meetup members meeting regularly in different cities around the world. I think pretty much every week, there is an H2O Meetup in some part of the world. So if you’re interested, do feel free to join the community and learn more about data science.
From a product perspective, these are the products that most of the community knows. On the left, you have our open source products. H2O is our core, and Spark + H20 AI Sparkling Water is our open source engine that runs on top of Apache Spark. It’s very popular, and probably the best machine learning on Spark, as we like to think of it, and our customers validate that. So nearly a third of our open source community uses us through Sparkling Water. And then we ported some of these algorithms to be accelerated on GPUs. We created a product called H2O4GPU that gives you algorithms like Gradient Boosting Machines (GBMs), Generalized Linear Models (GLMs), and K-Means Clustering. These are fully integrated with the latest GPUs so that it can take advantage of the latest and greatest hardware.
And finally, Driverless AI is our commercial automatic machine learning platform. It’s the fastest-growing platform in the space right now. We automate the entire machine learning workflow from data ingest all the way to production. We’ll talk a little more about that later in the session. But that’s extremely popular both for data scientists. It’s built by the Kaggle grandmasters, including Bojan, and it does some really cool stuff.
Let’s jump into today’s topic, right? So why AutoML? This is a quote from Gartner from one of their reports; and this is no news to anyone who’s in the industry. There’s a deep shortage of data scientists and ML experts, and it’s not likely to improve in the short to medium term, because in the colleges where folks are coming in, they are only beginning to adapt now to the latest techniques.
Another challenge is that the space is evolving so fast. A technique or a set of frameworks are popular today may not be popular in a year or two years down the line. So you need to constantly adapt and that makes it really challenging for creating a pool of experienced data scientists who can keep coming in. The goal is to use AI to build models to help increase the productivity of employees in different enterprises.
The challenge, of course, is when you show something like AutoML to data scientists who are busy coding and cracking away, and they’re too busy to try out something new. That’s why you want to first spend a little bit of time understanding what AutoML is, and what the state-of-the-art is in this space. Bojan has some really good stuff on setting up a framework for looking at where AutoML is as a space, and then looking at what are the top considerations are, whether you’re an enterprise or a data scientist. If you want to pick an AutoML platform for your company, what should you be looking at? With that, I’m going to hand it over to Bojan over here to let him take control and talk about the data science workflow the give key considerations for picking an AutoML platfrom.
Bojan Tunguz: Good afternoon or good morning everyone, depending on which time zone you are in. As they they mentioned, my name is Bojan Tunguz. I’m a Kaggle Grandmaster and a senior data scientist at H2O.ai. This presentation is adapted from a presentation I gave a few months ago at Kaggle Days. I want to take more of a kind of bird’s eye view of what machine learning is, what data science is, and how can we automate machine learning. Then we’ll take it from there and help you understand what different degrees of automated automation and machine learning may mean.
Here we have just a general kind of data science workflow. That goes from formulating the problem, acquiring data to data processing, modeling, deployment, and then monitoring. Many of these stages are actually included in our Driverless AI tools. But for this bird’s eye presentation, I will just concentrate on the middle part of this modeling part.
So the modeling is the part where you actually already have all the data in more or less the shape that you want it to be. You’re just having to create the best, most effective models that you can. Now, what the best model is will depend on different situations and different domains. Many times, it means just getting the most accurate model, but in many other situations, it means robots or some other thing that needs to be optimized. But to actually go beyond that, I want to kind of start and ask, why would anyone want to have AutoML? Now there are many reasons, as Vinod has mentioned. There’s increased demand for data scientists and machine learning applications – that demand is not always being met.
There’s a relative shortage of people with relevant skills, and the number of positions is far outstripping the number of degrees or any kind of certificates that being offered in a field. Sometimes you just want to try ML on some simple use case before committing to actually having a data scientist; yet you want to have ML that’s as good or at least close to being as good as something that a data scientist would, would produce. So you want to kind of try the waters before you dive in.
Then various non machine learning practitioners – analysts, marketers, IT staff – they want to have some part of their workflow that includes machine learning, but they don’t really necessarily need to have a full-time data scientist. So a tool that would do most of the things that the data scientist could machine learning models would be useful for them. And then if the tool is good enough and pretty much does everything that you really needed it to do, then you can save a lot of money. Instead of hiring a data scientist for $150,000 a year, you can get it through a tool that’s much cheaper than that, and then you can use it only as needed or only when you actually really need to get the most out of your investment.
If you have a tool that’s actually can do automation and machine learning, it allows you to iterate faster over development. Instead of having to code something and wait for the few days it takes to actually implement the code, you can actually just take the data, put it in inside of a pipeline, and within a few hours, you can actually have an answer to your question, or see whether the data that you have can actually answer those questions.
And then this is one of my favorites. If you perform more and more different experiments, you are getting closer to actually really formulating the problem and approaching machine learning problems as a scientist, meaning you’re running experiments, you’re looped into the outcomes, and you’re making a decisions on future iterations based on those decisions. One of my mantras is “put clients back in the data science.” One of the things is the number of people who are entering the data science field is increasing. This photo is actually my son. A few months ago, he picked up a book on neural networks. I don’t know how much he’s retained, but it’s a direction where we are headed with data science.
I was inspired by the six levels of autonomy in a driverless car. So there are different levels. We’re pretty much at a level three or four with driverless cars, depending who you ask. Fully autonomous cars would actually just kind of pick you up from spot A, and take you to spot B, without actually really needing to give any additional guidance. So, I want to come up with something similar for automated ML and come up with six different levels.
Actually, the first level will be level zero, which has no automation. You just code stuff from scratch, and probably use one of the relatively low-level programming languages like C++. One person who does that to this day is this Australian grandmaster – Michael Jahrer – who I had the privilege of working with. His code is just breathtakingly detailed and sophisticated, but obviously, most people cannot probably implement something like that.
Level two would be just the use of some high-level algorithm APIs like Sklearn, Keras, Pandas, H2O, XGBoost, etc. So that’s where most people who are participating on Kaggle these days – that’s where they are. We all rely on some of these tools. The reason that Kaggle has expanded and become so prevalent and popular over the last few years can be easily tracked to the promotion of some of these tools. Some of these, like Keras and XGBoost, were actually specifically introduced for Kaggle competitions. Now they’ve become standard for a lot of machine learning workflows.
Level two is when you automatically tune a hyperparameter, do some assembling, and some basic model selection. There are several high-level packages like Sklearn that help you optimize hyperparameter. There is Bayesian optimization, which is a very popular package – it’s the one that is highlighted the most; Hyperopt is another one. So there are several different strategies, and many of them are getting automated nowadays for tuning hyperparameters for some of these algorithms. Ensembling is sort of the golden standard these days for making the best and most predictive models. We all know that no single model can really outperform ensembling, and different models each have their own strengths and peculiarities.
So instead, some of the level two automatic machine learning tools can do these ensemblings by themselves. H2O.ai is one of them, for example. It can build several different models – XGBoost, linear models, and a few others, and then ensemble them into like a very strong predictive model.
Level three is more or less where we are at right now – it’s the industry standard – or maybe a little bit of a level four. This is where automatic technical feature engineering comes into place. And by that, I mean feature engineering that can be done just using features without fully understanding the domain where these features are coming from. That would be like label encoding for categorical features, target encoding in some cases, binning of different features, and things like that. So these are things that don’t fully depend, by and large, on expertise in a particular domain – they can be automated. And that’s where we are right now.
Another one is the introduction of the graphical user interface, which I think really liberates machine learning from being just a tool kit for software engineers and data scientists. It’s becoming much more accessible to a wide spectrum of people who want to use it for their daily work.
Now, if you have some kind of very specific domain, they have certain feature engineering would only make sense there. So, for instance, in a credit risk, loan per income would be a very good feature. Otherwise, you may not be able to figure it out if you’re just looking at anonymous features. So this would be an example of some domain-specific feature engineering. Data augmentation is more the domain-specific than domain data augmentation. An interesting example of that was when recently I came across a presentation where, in order to classify images with different eras, artifacts would be added to those images that would only be relevant to that year. So if you have a 1950’s image and you add a radio from that era, that could be okay. But if you add an iPhone to that, that would be a really bad idea. So that is an example of domain-specific data augmentation.
For level five, there would be full machine learning automation. It’s the ability to come up with superhuman strategies for solving hard machine learning problems without any input or guidance, and then possibly having fully conversational interaction with the human users. So instead of talking to a data scientist, you could talk to an automated machine learning tool and come up with a strategy of how to best formulate a problem and what kind of a model to create. Now, we are still pretty far away from thinking about this stuff. Many people have told me that sounds more like science fiction than reality. So the big question is: is full AutoML even possible?
According to the free lunch theorem, there is no single approach to any machine learning problem. It’s impossible to come up with an algorithm that will outperform all the others. “Real-world” problems are very specialized and form a very small, finite set of domains. For real-world problems, we do have people who have expertise in different fields who can come up with strategies, and we can learn from those. I came up with a “Kaggle Optimal Solution” which is the best solution that could be obtained through a Kaggle competition, provided there are no leaks, special circumstances, or other exogenous limitations.
So Kaggle is proven to outperform, a lot of times, even the best domain experts in a particular field. This would be some kind of superhuman possibility. So if you have enough people working on a problem for extended periods of time, who are familiar with the machine learning tools, they can come up with optimal solutions that in many ways, no single human could possibly come up with. Now we know that these solutions do exist, because there are Kaggle competitions. If we could capture this, that would be something that a fully automated machine learning environment could do.
Superhuman AutoML would beat the best Kaggles almost every time. And that would be something that is still far ahead, and it’s not really clear how we can we get to that point. So again, I’ll just briefly go over some of these levels. So no automation, machine learning algorithms from scratch, it requires a very high level of software engineering, and it’s not easy to do for actual practitioners of data science. In the old days, most of the people who were doing machine learning would actually be writing the tools from scratch, and this was obviously not an optimal use of their time. And it’s very, very hard to scale. This is my very crude depiction of what making those tools looked like back in the old days when we were doing it from scratch.
There are times where you do want to do something from scratch, namely when you really want to understand some of these algorithms. There are some good resources out there, including this book that I highly recommend where you do some of these algorithms from scratch. If you’ve been a data scientist for a few years and have some familiarity with all of these algorithms, it would behoove you to take a look at this book and really try to understand and then try to implement some of them from scratch. Obviously, you can’t do some of these more complicated neural networks from scratch – that would be impossible. But some of the simpler algorithms would definitely be worth your time from an educational perspective.
We live in a world where there’s a plethora of different APIs to use for building machine learning algorithms and data science pipelines. If you’re into that stuff, this is really a great place to be. But several times a day I hear about a new, great tool that is being implemented that really does part of your data science of workflow – and it’s very hard to keep track of all of these things. So it’s great from the standpoint that it can do a lot of things, but it’s still very hard to kind of keep track of all the tools that are out there.
So high-level APIs, as I mentioned – things like sklearn, XGBoost, Keras, H2O – these allow some novices who have some coding to actually go from building very simple models to being very proficient in a short amount of time. It’s pretty standardized. For instance, the sklearn API is becoming sort of the default these days. And here for instance, is a sklearn linear model import regression. If you remember this slide from a few slides back where it was implemented in the C++, you can immediately see why having these APIs would really make life so much easier for most practicing data scientists.
For level two, you have automatic tuning. You could consider it the first real AutoML. It’s where you start taking several different models – you take a data set, a specified target, and let it create the best algorithms out of some subset of algorithms that you can think of. It selects a validation strategy, such as a cross-validation vs. a time validation split. Now in most cases, this automatic cross validation works, but the really hard cases are the ones where there are some peculiarities of the data, and a simple out-of-time validation or cross-validation can actually really burn you. This is one of those things where you really need to know that your data is such that some of these cross validation strategies can work out of the box.
In level 2, it optimizes hyperparameters. It chooses the best learning rate, for instance, and the number of trees and sub- sampling of your data set. These are all hyperparameters that many of these APIs let you pick. But you know, it’s very hard to understand which ones are the optimal for any given problem. And then it performs basic ensembling. For instance, if you have two algorithms and they give you predictions, you can take the average of those two predictions, and that’s very simple ensembling. But if one is performing much better than the other one, but the other one is not completely useless, then it can be tricky to find the right one. Some of these automatic tools do that for you.
In hyperparameter optimization, level two, there are several approaches to it. There is grid search, where you use a well-defined grid of values, and where you have it prioritized in some space, and then try every one of them; that’s very computationally expensive. Then there’s random search, where you would use a subset of those hyperparameters. And this is the comparison between the two. Then there’s Bayesian search, which uses Bayes’ Theorem to actually do something very smart about where to look for the next potential hyper parameters, given the one that you already looked at. Then there’s the use of Gaussian processes to actually look for a different potential hyperparameters.
Many of the level one algorithms are also ensembles like Random Forest or XGBoost. But for all practical purposes, as practicing data scientists, we treat them as a fundamental algorithms that we want to assemble with some other ones. We want to take a look, for instance, at blending, which is finding a weighted average of weak models. We have boosting, which is iteratively improved blending, and there’s stacking, where you create K-Fold predictions of base models, and use those predictions as meta features for another model. So these are some of the basic assembling approaches, and most of the level two algorithms and level two AutoML solutions can do a pretty good job with these.
For instance, this is an example of a very complicated ensemble, where you ensemble things of several different levels for a final prediction. And this particular example is from my solution for distinguishing between cats and dogs. Now you would think that this is one of the simplest possible problems, but you can do some very fancy and very complicated ensembling, and some of the more advanced machine learning tools can do this for you.
In level three, which is where most of the good solutions on the market are right now, we have automatic (technical) feature engineering and feature selection. Feature engineering refers to the fact that you create new features, or you do something to existing features to make them yield more information. There are many ways of doing it.
I mentioned some of these – binning, feature interactions – many of these are implemented in good AutoML solutions. Technical feature selection is a little bit harder to do, and it’s not very well done by most experienced machine learning practitioners. Once you create many of these new features, many of them may not be optimal for your problem. So, you have some tool that automatically decides which features of the ones that you created to use would be good for the model. Technical data augmentation is where you can flip, rotate, and do other things.
And finally, we have a graphical user interface, makes non-technical people be much more effective with creating good machine learning models that they can use for their own workflows. An analogy to this is if you have word processing versus typesetting everything in LaTeX, this enables more people to write very good looking and effective documents, even though they don’t have any particular low-level types of skills.
For automatic feature engineering, you can use different encodings for categorical data. You can use the different encodings of numerical data, as well as aggregations and feature interactions. These are all of the things that they can do. Word embedding is when you have textual data and you actually turn text into some kind of vector in some vector space. For images, you can have pre-trained neural networks that can actually turn images into an array that you can then use for other machine learning algorithms.
In terms of technical feature selection, we have things like selecting features based on feature importance of some test model. You train the model to see which features are the most that are relevant. You have forward feature selection, where you select a feature, see how it performed, add another feature, and see if the model improves. If it doesn’t, throw it away. So you go one-by-one through all of the features until we set that works well with your given model.
The opposite is recursive feature elimination where you actually start with all the features you have and then eliminate them one by one. All of these are very computationally-intensive, and may be not optimal for most problems, especially since the number of features easily can go into the thousands or tens of thousands. There’s also permutation impact, where you just take one feature out at a time, and see how it works without that feature.
As I mentioned before, generating new features results in a combinatorial explosion of possible features to use, and we need some more sophisticated strategies to actually select features that would be useful for our model. And this is what some of the best AutoML tools that are on the market right now can do for you.
For instance, one of the approaches for this is genetic programming, where you “evolve” features, look at which ones survive, and create a new subset of features based on those evolutionary algorithms. And that’s something that they implement in H2O’s Driverless AI.
When it comes to technical augmentation, there are many different things that you can do. For instance, you can add stock value prices to temporal data. If you want to look at how the economy’s performing or if some other financial indicators performing well, you can look at stock prices over time to see like if there’s some correlation between the stock market and before trades for some kind of loan.
You can add geographical information. This is something that’s very informative, but with this one, you have to be very careful that you don’t introduce some kind of regional biases in your models, which for regulatory purposes, can be a questionable practice. FICO scores can be another important additional piece of information if you’re running a loan business. FICO scores obviously would be a great piece of information to have.
Some of our teams at Kaggle competitions have discovered that when you have textual data, and you do some kind of automated translation of the text into another language and then translate it back in the original language – that introduces some noise. And the hope is that this noise would help you with ensemble of the model that you’re building. And again, this is very technically straightforward – there’s no human in the loop in this process.
Injecting noise is a tried and tested way of dealing with data augmentation. You can do various math transformation on sound and image data. Then there is image-specific information: blurring, brightening, color saturation, etc. There are different ways that technical data augmentation can be done. There are libraries out there that do it for you, but there are AutoML solutions that also do it in the background.
GUI is one of the things that I think every level three AutoML needs to have. It facilitates interaction with software, it allows for many non-technical people to use it, and it further facilitates iterations and development.
Level four is beyond where we are right now. It requires auto-specific feature engineering. It requires the ability to combine several different data sources into a single one suitable for ML exploration. I like going back to the a loan business problem: if you have different tables that come from different aspects of the loan process, how you combine these tables into a single one that can be suitable for machine learning is not trivial. How do you aggregate data from a transaction history? None of these things are easy to do, and we don’t have an automated way of doing it. Some domain-specific cases may have it, but in general, we don’t have automated feature generation.
You can do some advanced hyperparameter tuning, where you go beyond Bayesian optimization and some of those things that we mentioned before. We have domain and problem-specific featuring engineering, where you do aggregations according to what makes sense for your problem. Adding a particular kind of noise to images that really make sense for those kinds of images – these require a lot of human interaction and human understanding of the problem.
Then we have the ability to combine several different data sources, such as joining tables, and understanding which mergers makes sense and executing them. All of these things still require a lot of human interaction. Manual hyperparameter tuning is still my number one go-to approach for tuning hyperparameters. I have tried many of these packages, and I can still, by using my own intuition and experience with tuning some of these XGBoosts, for instance – I can still come up with better hyperparameters than any of the auto solutions out there. So there’s still a lot of intuition and human understanding that’s involved that we still are not able to completely capture.
With advanced hyperparameter tuning, there’s a deeper understanding of the data. It may require some transfer learning. It means building hyperparameters based on a previous experience with different hyperparameters.
For auto-specific feature engineering, domain understanding will be crucial to create different features. The ability to get additional data based on the problem/domain, and to integrate it into the ML pipeline – that’s still where real-world data scientists comes into play.
With level 5, we’re in full ML automation. This is a little bit beyond what even we are able to foresee right now. But this is a what a fully automated machine learning with solution would look like. It requires the ability to come up with super-human strategies for solving hard ML problems, without any input or guidance. It could have a fully conversational interaction with the human user. So this would be, essentially, having a Kaggle Grandmaster sitting in front of you and coming up with solutions to your problem.
Up to level 4, all of the automation is essentially “hard-coded.” You still know there’s something that you have to come up with prior to deploying the solution, and then having solution run itself. But now with full automation, we need to use a machine learning approach to know how to build it. That means using machine learning to teach AutoML systems how to do machine learning.
Machine learning is an approach to building software and products that requires a lot of data. Now we would really require a lot of data and use cases on how to build machine learning pipelines, and then train the model on all of them to actually come up with a better, super-human approach. We might need some unsupervised approaches. This would be machine learning for machine learning for machine learning. The idea is in principle, simple: give the ML system a large collection of ML problems and their solutions, and then let it “learn” how to build ML systems. Simple execution is hard because we have relatively few machine learning problems to work with. It’s very daunting; even the simplest ML problem requires thousands of instances to train on for decent performance.
However, we probably don’t need to build all this from scratch. We might be able to bootstrap on top of the previous level of automation. If you have all the previous levels and it’s working fine, then you can do something maybe like reinforcement learning or you can use unsupervised techniques. If we can parameterize our problems and parametrized possible solutions to them, then we can come up with the universe of human-relevant ML problems, and we might be able to find some patterns in the data itself. So unsupervised methods are much better-suited for situations where you don’t have too much data, but you still want to understand something about it and come up with solutions.
Then there’s reinforcement learning: building ML solutions and based on how well they perform, adjust their architecture. So this would be an environment where machine learning tools can learn from the experience of trying to solve machine learning problems. This is adversarial auto ML: have AutoML systems compete against each other. Make a Kaggle competition that’s only open to AutoML systems and iterate. This is possibly something that could come in the future, where you have different AutoML systems compete with each other and learn from the experience.
Fully conversational interaction with a human user would be another thing that you would need from such a system. So again, this would be an AutoML system that doesn’t necessarily pass the full Turing test, but has enough of a domain-specific understanding of machine learning that can pass Turing tests for machine learning problems and be able to interact with you like you would with any other data scientist. It could democratize machine learning and making it even more accessible than it is right now. Formulating a machine learning problem is a very interactive process where you interact with domain experts, other data scientists, and analysts to actually come up with something that can be really useful for everyone.
There are a few downsides to AutoML, but I’m not going to spend too much time on these. I’m just going to click through them, because I want to hand it over to Vinod, who will introduce you a little bit more to our driverless AI system. All right.
Vinod Iyengar: Thank you, Bojan. It is extremely useful to understand where the state of the space is. In a nutshell, to recap what Bojan said: I think of them almost like gen one, which is basically a lot of the open source frameworks which do hyperparameter tuning, ensembling, and a sort of leaderboard sort of approach. And then you have gen two, where we’re getting into advanced feature engineering and HPC-powered evolutionary model development to a lot of the work. And then going forward to gen three, which is basically kind of like getting the full AutoML, where AU is available at their fingertips. Now with that said, if you’re an organization or you’re a data scientist, and you’re looking to pick a platform to do AutoML, what should you be looking at today?
Based on what the state of the industry is, here are the top considerations in my mind. To begin with, think of how can you automate the entire workflow. As Bojan mentioned earlier, talk about feature engine and modeling, but essentially, have data prep, data ingest, and at the downstream, you have model deployment monitoring. It’s about figuring out how your platforms can automate as much as possible so that you can spend more time thinking about the problem framing and actually evaluate if it’s doing the right thing.
Portability and flexibility: what that means is you don’t want to be locked into one vendor or one environment. So there are considerations are on where the data is. Is it on cloud, on-prem? Where is the compute running? What about running it on different sort hardware like GPUs or CPUs or using or the latest process, for that matter. And then think about running them in different configurations as well.
Along with that comes to the idea of extensibility. So until now, we’ve been thinking about automation, but as I mentioned, it’s going to get to full automation. In the interim, can you extend it? Can you customize a platform or add your own sort of flavor to it? These could be new algorithms that the platform may not have today, but think about other ways to add those. So that’s an important consideration.
And finally, explainability, trust, and transparency. I think it is very important. With automation, it becomes even more critical. Essentially, if you think that algorithms are black boxes themselves, now you have a massive black box, which does AutoML. You give it data, and it gives you a model and a prediction.
If you can’t explain and validate the model, then that becomes a big challenge. Thinking about what tools are available to achieve that is a big consideration where you’re picking a platform. Then you also want future proofing. Part of the reason why you’re doing AutoML is that one person cannot learn everything or be able to master everything. And that goes back to the no free lunch theorem. I have my own take on it: which is a no free lunch theorem for data centers as well – no one data scientist is an expert in every single field. Even in Kaggle, we have some folks who are deep learning experts, and some folks who focus on feature engineering. So you want the tool or the platform to do all of it for you; thinking about that becomes critical.
Just to extend that a little bit further, if you look at the challenges in the ML model development workflow, there is feature engineering model building and model deployment at the high level, but even within those, there are subsections. You’re looking at things like simple encoding, advanced encoding, feature engineering, and feature generation, which is looking at interaction effects or transformations. Then within the model building part of itself, you have algorithm selection, what framework to use, and doing the parameters and ensembling, that Bojan touched upon quite a bit. When it comes to deployment, you are looking at how can you generate a pipeline, easily apply it, monitor it, explain the results, and then document the whole workflow. All of these are time-consuming tasks because each of them requires a lot of work.
Look at tools that can automate the entire workflow. At H2O, for example, we have two AutoML offerings. One is the H2O open source AutoML, which is very widely used, and that is basically focused on the model building part. So it automates the model building part thoroughly. It does all the algorithm selection for you, the parameter tuning, and then ensembling. Also, there’s some simple coding and it generates a pipeline, at least for the model building portion. Then many jump over to H2O Driverless AI. We took the remaining pieces as well and automated it. That becomes useful for you to see when you’re picking a platform.
Coming to the portability and flexibility question that I mentioned earlier – you really want to know if your platform run on cloud, if your platform can run in a hybrid fashion, and if it can run on prem. So for example, as your data size increases, you want the ability to run it in a distributed fashion, the ability to handle the larger data sets, or handle varied datasets, for example. Those become critical. For example, data sources could be coming from whole different places, right? So you cannot restrict yourself to one single data source. It could be coming from on prem or cloud. And then – integrations, because there are a whole bunch of tools in the arena that you probably are going to need to use along with your AutoML platform.
Think about picking platforms which can integrate as much as possible with other tools. If you already have certain tools that are set up for data munching, data prep, and infrastructure management, the ability to integrate those with existing frameworks and platforms is critical. In the same breath, also think about all the different data sources where your data might be coming in from. You can’t just have simple CSVs or TSVs, but there are also big data formats: there are data frameworks and data frames, which are coming from different platforms. What you want is a breadth of connectors and ingestion sources.
What does this look like? These are some examples that we published ourselves with some of our partner sites. When we talk about flexibility, this is a cloud deployment that a lot of our customers do. The data is coming in from Snowflake and other data sources from the cloud, pulled in through Alteryx to do the data prep, and then to Driverless AI for the featuring and model building. All of this is running on either AWS or Azure or GCP. And this is just one reference architecture that’s very popular with our customers.
But very similar to that, we can do a hybrid one. Again, instead of running two data warehouses, you are running it on BlueData, which in turn connects to on-prem HDFS. Again, you would use some data prep tool in between, and then run this around the feature engineering in machine learning, in Driverless AI. So the environments you run are going to be very different.
If I flip this to a completely on-prem-based solution, now you’re looking at data integration from a whole bunch of data sources, like SQL data warehouses or HTFS or just file systems, and pulling that in to do the data quality transformation in something like Spark, for example. And then you run the feature engine and model building in Driverless AI. You want the platform to be flexible across all these different types of deployments.
And as you’re considering that, think about these things: data gravity becomes critical. As your data size gets larger and larger, you don’t really want to ship it across. You need to find a platform that can run close to your data. It also makes for a basic, secure connection – the sense that oftentimes, some of this data might have private data, PII information, or HIPAA-compliant data, in which case you don’t want them to be sent to different places without a lot of careful vetting. So if everything can run in your own secure firewall, then it’s perfect. So at least look at frameworks.
The space is rapidly evolving, too. You need to know what you don’t know, right? Having an awareness of what you’re not a good at is important, so that you can have the platform find the best ideas for you. So look at the latest technologies, the latest techniques, new networks, and new architectures; can you be the first market? Pick a platform that can give you that velocity of innovation, and implement that for your company.
Finally, you have hardware tools. There are a lot of improvements and progress happening on the hardware front. The latest CPUs and GPUs are much, much faster than even a couple of years ago. And especially when it comes to AutoML – it is a very, very compute-intensive and resource-intensive operation. You do want to take advantage of the latest innovations there. Find platforms that can take advantage of the latest XGPUs, for example.
This is just an example of how, for example, at Driverless AI, we are able to run this on completely different configurations, say like on something like Optane DC persistent memory from Intel, but with a very heavily, overloaded sort of persistent memory and the latest CPU infrastructure. Conversely, on the other side, with the latest DGX-1 and DGX-2, we can run on GPUs as well, with up to 16 GPUs scaled on a single machine – that can give you some phenomenal performance as well. Find platforms that can take advantage of the latest hardware – these are a couple of good examples over here.
Let’s talk about extensibility for a second. We think of it as three different ways. The first is, even if your platform does some automatic feature engineering, what about other stuff that you know as a domain expert – can you bring those in? So custom feature engineering is important. That means doing your own transformations, your own domain-specific interactions, for example, bringing them into the platform – can the platform take advantage of them? Custom ML algorithms – obviously, there are a whole bunch of different frameworks out there which are really good at a bunch of different problems. But there’s new stuff coming out all the time. So you want to be able to try new stuff and see quickly if that makes sense for you. Having the ability to have custom ML algorithms is critical.
Finally, custom loss functions. This is the part which is going to be very tough to automate anyway, right? This is the part that you know very well, so you know what your customer’s lifetime value is. A highly-valued customer when compared to a not-so- valuable customer is very different, so you want your loss function to be optimized for that. You want to optimize for business metrics that are important for you. It’s important to find a platform that allows you to do that. In Driverless AI, for example, we have now the ability to add custom transformers. Literally, you can bring in a simple pipeline recipe that can do this for you. Any data scientist can implement their own transformer or their own custom models; Driverless AI’s engine will use them, just as if they were native. That’s very important.
In addition, look for platforms that have a whole community of open source recipes. Find a platform that has a pre-built community, because then you can reuse and repurpose. There’s a lot of collaboration that happens, and things get better. Obviously, no company can add an unlimited amount of people. The community is obviously larger, especially if you can take advantage of the Kaggle community, which has some phenomenal scripts and recipes. See if you can bring them in on your platform.
Let’s take a few minutes to talk about trust, transparency, and explainability – they’re critical. Why do they matter? You’re letting the machine make all the choices – the parameters, the features, the tuning, even the algorithms – everything is done by the machine. You essentially have a bigger black box now. You had to sort of figure out the trade-offs in interpretability and performance. So there are obviously a lot of techniques such as using surrogate models, for example, for getting approximate explanations. Those are important – can your platform show you how to look for those as you’re making decisions?
Similarly, there are obviously the multiple different models, right? Again, it goes back to the no free lunch theorem. If there are a lot of different models, which ones do you pick? So use the same objective metrics: can you pick a model which is simpler to understand and more interpretable. But at the same time, performance is just close enough to the best model of that. Fairness and social aspects start becoming more and more critical. ML models are becoming very prevalent in things like credit scoring, or even healthcare use cases. It’s critical to evaluate that the model doesn’t bring in human bias in or build discriminatory models that can discriminate on the basis of gender, age, ethnicity, et cetera. So how do you do that? So you want techniques that can help identify and remediate disparate impact. Find tools that can help you do that and see if your platform supports that.
Trust is, of course, critical. You want a whole bunch of model debugging tools. Think about ways to debug the models. There are obviously the classic techniques, data mining techniques, like PDP, etc. Going beyond that, find out what techniques are available to help you debug them and understand out how well it’s performing in the real world. It’s about getting that level of granularity in the predictions on each individual prediction. It’s important.
Finally, the last two pieces are security and hacking. This is a very new topic and I highly recommend watching the webinars by Patrick Hall as a full look at this topic. There’s a blog as well, but can you use these techniques to understand if your model is vulnerable to certain sort of regions of the space. Can your model be hacked by some influences that are designed to fool the model? How do you do that? Use different techniques to identify those things and solve for those – build adversarial models and versatile data sets to actually tune them, like find those weak spots in the model. Think about those considerations as well, especially when you’re using AutoML, because you don’t know what went into it – you may not know fully.
Then, regulatory and control environments becomes very critical. This is something that we tackle all the time. The largest banks, healthcare companies, and insurance companies use our models in production. So there are legal requirements to be able to explain every single prediction. For example, if you deny credit to someone, you have to explain exactly why that happened. Similarly, if you are going an insurance company and are denying a claim, you have to explain that as well. So fairness and a bias reduction are important considerations as well. Think about all these things when you’re making a decision on what platform to buy – ask your vendor these questions. We at H2O are very focused on these things. That’s why we’ve talking about this. We truly believe that this is important for the space. Oftentimes, we are in front of customers who ask us this question, and we say the same thing.
We probably don’t have any time for Q & A, but we’ll try to get back with the answers to those who posed questions. Thank you, Bojan, for jumping in last minute for this webinar. You gave a wonderful overview on the space of AutoML. We hope that this you found this webinar to be fruitful, productive, and informative. Thank you for joining us, and we’ll send the live recording to you shortly.
Bojan Tunguz: Bojan was born in Sarajevo, Bosnia & Herzegovina, which my family fled for Croatia during the war. He came to the US as a high school exchange student, and manage to realize my dream of studying Physics. He has worked in academia for a few years, but for various personal and professional reasons decided to leave it. A few years ago he stumbled upon the wonderful world of Data Science and Machine Learning, and felt like he discovered his second vocation in life. Some of you may know him through Kaggle, where he’s currently ranked in top 20 for competition, and in top 10 for kernels and discussions. He has a wonderful wife and three amazing little boys that keep me constantly busy and amused. He is a voracious reader, passionate about tinkering with all sorts of tools and gadgets, loves digital photography, and really enjoy hiking in the woods
Vinod Iyengar: Vinod Iyengar comes with over seven years of marketing and data science experience in multiple startups. He brings a strong analytical side and a metrics-driven approach to marketing. When he’s not busy hacking, Vinod loves painting and reading. He’s a huge foodie and will eat anything that doesn’t crawl, swim, or move.