Machine Learning for IT
Read the Full Transcript
Patrick Moran: Hello, and thank you for joining our webinar titled Machine Learning for IT. My name is Patrick Moran, and I’m on the marketing team here at H2O.ai. I’d love to start off by introducing our speakers. Vinod Iyengar is VP of Marketing and Technical Alliances at H2O. He leads all product marketing efforts, new product development, and integrations with partners. Vinod comes with over 10 years of marketing experience.
Ronak Chokshi is responsible for industry vertical solutions and product marketing content at H2O. Prior to H2O, he was Product Marketing Lead at MapR Technologies. Ronak comes with 15 plus years of experience with cross-functional roles in startups, midsize companies, and large corporations, shaping solutions, strategy, and leading go-to-market execution across a variety of industry verticals.
Before I hand it over to Vinod and Ronak, I’d like to go over the following housekeeping items. Please feel free to send us your questions throughout the session via the questions tab. We’ll be happy to answer them towards the end of the webinar. This webinar is being recorded. A copy of the webinar recording and slide deck will be available after the presentation is over. I’d like to hand it over to Ronak and Vinod.
Ronak Chokshi: Hello, my name is Ronak. In this one hour, we would like to go through a few things here. I’d like to give you a bit of an overview of what we’re going to talk about in this session. We’ll talk about some challenges and imperatives as enterprises adopt AI. We’ll talk H2O platforms, and how we are enabling IT and DevOps in this journey, for our customers. Then we will show some demos.
So let’s move on to an overview. We all know that AI is pushing the boundaries for enterprises today, regardless of your industry. Predicting machine failure, preventing diseases, anti-money laundering, fraud prevention, and so on. You also know that organizations have formed dedicated teams, data science, and data engineering teams, to solve these problems. Each one of these are data-driven use cases, as we see them. As a result, you need high-quality data to train the machine learning and AI, before they go into production. But deploying these AI-driven use cases requires IT support throughout this process.
Vinod Iyengar: Yes, and just to add to that, we always talk at H2O about the idea of data science and machine learning being a team sport. As Ronak mentioned, typically, a lot of focus is on the data scientists and the data engineers, but IT and DevOps are equally important, if not more, parts of this equation. Because eventually, what actually makes it to production, or what gets used in making an impact, is dependent on how these models or these outputs of machine learning and data science workflows get applied.
That’s one of the critical reasons why we chose this topic. We are focusing on how data science can be aligned with IT professionals and DevOps folks who can start to think about what tools, what best practices, and what considerations they need to look at before implementing these models.
Ronak Chokshi: Right. Well said. This slide shows our understanding of the role that IT and DevOps play in this process. They’re very important to this journey. They are responsible for providing the data, managing the infrastructure, and then that’s the start of the process. The data science team and the data engineering team help with preparation of the data, ingestion, building the models, training them, and validating them. Then once again, it comes back to the DevOps and IT for deploying those models, and then further monitoring them.
This looks like a short process, but we all know that this is a very sophisticated process. We’ll show, during this session, how H2O products are enabling this journey for the IT and DevOps teams.
All right, so these are some of the things that we know that the IT team is up against. As an IT leader, you have to manage the infrastructure, which includes provisioning your infrastructure, administering it, enabling multitenancy, and enabling deployments. There is a lot of data security that goes into managing this infrastructure.
There is very sophisticated governance that needs to be handled, especially in regulated industries and so on. There’s also the integration aspect. Your enterprise often runs hundreds of applications, and all of that has to be integrated into the process as new use cases are implemented and new models are built.
Then, on to containerization. Obviously, we are now at a time where Kubernetes deployments are widespread. Applications are becoming more and more portable. That, once again, lands in IT’s lap, to manage the containers.
Vinod Iyengar: Just to add to that – this is true for most IT workloads, not just analytics and data science. Especially true here, partly because of the fact that data science and machine learning is changing at such a rapid clip. The frameworks and the tools that were used, say five years ago, are obsolete now. There are completely new frameworks and new tools.
For example, in the last five years, a different set of algorithms, like more tree-based, deep learning-based frameworks, are being used more widely across the organization. These require a completely different set of architecture, not just for deployment, but also for training.
So part of what Ronak touched upon in management is the provisioning of instances. For example, now you have to provision, going from commodity hardware and old Hadoop data to more HPC-type of hardware with fast CPUs and GPUs. You may have fast SSDs, so that the data read and ingest rates are much higher.
Then similarly, for deployment as well, consider the use of dedicated inferencing engines like GPU T4 chips or even some of the new, latest Intel CPUs – those are all considerations you need to think about. As these frameworks keep changing or updating, are there still best practices? Are there still underlying guiding principles that you can use, to make sure you are up-to-date? You’re supporting your organization as data scientists, because data scientists continue to try new algorithms, new frameworks, and they want to be able to deploy them. How do you support all that without losing your sleep?
Ronak Chokshi: Right. Now, let’s talk about data access. Again, when the business teams and the data science teams have to implement and put certain use cases into production, you have to pull data from all kinds of sources. The data itself is varied in terms of its size. They could be machine logs, they could be sensor data, transactional logs, et cetera, right? Then in terms of sources, data could be sitting in any one of these cloud providers, or Hadoop on-premises, or applications, and so on. It’s a very complex landscape that you’re dealing with as an IT leader.
So let’s go into the operational simplicity. I spoke about the data science process and the involvement of IT in this process. The table at the bottom of this slide includes the things that you would have to consider. Meaning, when it comes to installing new software, it has to be simple, it has to connect to applications, and it has to account for on-premises as well as cloud infrastructure. It’s the same thing with management.
Upgrades have to be managed. They have to be seamless. There has to be enterprise support available for the upgrades, in case something goes wrong. Rollbacks, monitoring. Monitoring is a big thing, because you don’t want resources idling, and your IT bill goes up higher at the end of the month. If there is a drift or anomaly detections, specifically with models, then that is something that the DevOps team is concerned about.
Security, I mentioned that earlier. Lastly, governance. There is data traceability and model traceability, along with policy management that is very critical, like I said earlier, in regulatory-heavy industries.
Vinod Iyengar: Basically, there are three basic objectives for the IT organization. The first is, increasing the productivity of your enterprise. Meaning, make sure that your data scientists and your users are able to operate at the best, most efficient level. Which means, they have to have access to the latest versions of the software, and to the best possible hardware. Their compute cycles are much lower.
The second is management of cost. Most IT organizations are run as a cost center, so there is always a big pressure to make sure that they are optimizing resources. How do you do that? Make sure that you are fully utilizing all the resource you have at hand. So, using best practices to avoid idle time, or idle resources, or even unused resources. Make sure that those are reduced.
At the same time, also make sure that you’re making the right choices around what frameworks to use, and what platforms to use, so that you can distribute the workloads if possible. Instead of buying a lot of new hardware, you can use the existing hardware to do more work. Those are all considerations.
Similarly, build vs. buy: often, you can build your own on-prem data center, but hybrid is kind of the new game in town because it doesn’t make sense to keep procuring hardware when you have these spiky workloads. Machine learning workloads can be extremely spiky. You might have a need, like a bunch of GPU farms just to run a big experiment, and then you don’t need them anymore.
So, in those cases, maybe you are better off just running them on cloud for a while and then collecting the results back. Resource management is often a big part of what you do with managing costs.
Finally, the third big, probably equally important objective, but often much more important for regulated industries, is risk management, if not risk avoidance. What does it mean? The last thing a bank wants is a breach. The last thing a healthcare firm wants are privacy breaches. So, you need to make sure that your access control governance practices are really rock solid. The easiest way to prevent risk is to not do anything. You can shut down everything, and you won’t have any work being done, which is why I say, it’s risk management.
You still need to allow your data scientists to be able to make use of the data to build models. At the same time, put best practices in place, and put policies in place. Often, it’s unintentional, right? Unintentional breaches, unintentional data sharing. These are obviously three orthogonal objectives, but they have to come together. That’s the goal for all the IT organizations, and this is especially true when you’re helping support the data science and machine learning workloads.
Ronak Chokshi: That’s an excellent summary. All right, let’s get to our product suite. On the left here are two open source products. You can look them up as well. The first is H2O, what we call H2O Core. It is 100% open source, licensed by Apache Version 2. This is built for data science teams.
The second is called Sparkling Water, primarily where H2O runs on Hadoop and Spark deployments. On the right here, is Driverless AI, which is our commercial software. Primarily, it’s an enterprise-ready software, and built for data analysts, data engineers, data scientists, domain scientists, and so on. We’ll show Driverless AI in a few minutes, when we get to the demo. This Driverless AI primarily enables fully automatic machine learning. Everything from feature engineering, model development, hyperparameter tuning, deployment, model management, and monitoring.
Vinod Iyengar: Both H2O open source and Sparkling Water, as Ronak said, are enterprise-grade software. These are built for the largest banks in the world, and the largest healthcare, telcos in the world, and they’re used really widely across all the Fortune 500 companies. What this means for you as an IT professional, or an organization enterprise, is that we have taken care of all the required practices on security and on governance. We make sure that the software can run inside your IT setup, whether it’s on-prem, cloud, or hybrid. We’re going to look at all of those in a little bit more detail, but we have enterprise software to ensure that these work well with your existing ecosystem of products. This is built to fit into whatever else you have set in place.
Ronak Chokshi: That’s the next three slides. Let’s talk about Driverless AI and how it benefits IT. Just like Vinod mentioned, you get the flexibility to deploy in any of the three major cloud providers, as well as on-premises. We’ll show that in a second. You can ingest data from a variety of cloud object stores, on-premise data stores, et cetera. Security is through Kerberos, LDAP, and SSH. All kinds of third party algorithms are included, as well as deployment flexibility, monitoring, management, and a very convenient upgrade mechanism.
As you can see, this is ideal for making your own AI across any of the cloud or on-premise environments. This is Enterprise Puddle. This is unique for instances where you want the flexibility to spin-up instances of Driverless AI in any virtual private cloud environments. Again, you have the same controls and the same flexibility that you get with Driverless AI, along with specific controls – granular controls, user controls, and permissions that IT often needs. I won’t go through all the benefits here. Vinod, you want to add anything?
Vinod Iyengar: Sure. We built Enterprise Puddle basically to let the IT and DevOps professionals provide almost a managed service for their internal consumers who are the data scientists. Typically, what’ll happen is, if you are setting up AWS Access or Azure Access for each individual user, they will have to go into the console and then spin up machines. They will have to set up their own IAM credentials, provision the VMs, connect to SD buckets, configure Azure Blob Storage, wherever the data is, and set those up. This can be pretty time-consuming, not to mention it requires a non-trivial amount of understanding, because if you’re a data scientist, or not used to the cloud best practices, you may end up doing something wrong, or spending a lot of time thinking that out. Often, these things don’t change.
With Puddle, we know exactly what Driverless needs, or even H2O needs. Both for H2O and Driverless AI, we have set up platform templates – AMIs. We packaged all those up, so you can essentially provide an internal managed cloud for your users, who can log into the cloud, and then they will have predefined instance types available that can be spun up in just a couple of clicks. We’ll show in the demo later, at the end of the presentation, how convenient that can be.
The second big benefit is, of course, the management of resources. You can specify how much access each data scientist can have. For example, you want to allow only a pool of 10 machines with so many CPUs and GPUs for a group of data scientists. You can specify that, and then that way, if someone tries to exceed it, the service will prevent them from doing that. That’s extremely useful to control the costs, which can be pretty significant on the cloud.
Vinod Iyengar: The third thing we do is actually connect with your existing security connections, whether it’s LDAP or Kerberos, or even Azure Active Directory, or AWS IAM credentials. Whatever your security credentials might be, you can use the same for managing Puddle instances as well. Again, as Ronak highlighted, this exclusively runs in your VPC. This is not an SaaS offering; this is a managed offering that you run in your own VPC. It’s extremely powerful, and we have a couple of really interesting customers now – large financial institutions who are using this to manage tens of users on their AWS and Azure accounts. For each group, they’ve set small Puddles of sorts. Within that Puddle, that will satisfy 5 to 10 users at a time, but this can very easily handle even hundreds of users, for that matter.
Ronak Chokshi: This is perfect for environments where you want the data modeling and machine learning to happen within the enterprise, without data going out of the enterprise – so within your VPC. Lastly, we have Steam here, which is very similar to Puddle; it has the same controls, as well as the same user controls and permissions. The difference here is that you have the flexibility to spin up instances of H2O and run them as YARN jobs, so management is through YARN here.
Vinod Iyengar: Enterprise Steam, as Ronak mentioned, is a similar product, but mostly for on-prem management of H2O instances and H2O Sparkling Water clusters. Typically, you provision Steam to run on one of the nodes, on the H-node, and give it a few nodes available. Again, these are provisioned through YARN. The neat thing about Steam is that once you set up Steam with user credentials, just like a Puddle, you can also specify how many entitlements that each individual user or user group can have.
For example, you can set up a user group called data scientists and you can say, “They have access to 10 nodes, and each node can have up to, say 250 gigs of memory.” You’ve essentially allocated the max cluster size for that group. Once that’s done, every time someone comes into Steam and requests a cluster, they begin to specify that, “Hey, I need four nodes with 100 gigs each.” Steam will go to YARN and provision a cluster, and then make a proxy link available for the user, so that they can just start accessing H2O Flow or even a Jupyter Notebook. We also give you JupyterHub out of the box, so you can start a Python Notebook and then connect to a H2O cluster and start doing your work. The same thing can be done for Sparkling Water as well, both for the external mode and the internal mode. Whether you’re spinning up Sparkling Water inside Spark or as a standalone cluster, you can still use Steam to do it. With all of the benefits that I mentioned, Steam will still apply.
Again, you can connect to LDAP or other authentication mechanisms, and everything will be Kerberos. Because you will be connecting to your HDFS as well, through the Steam cluster, your data access credentials will be all carried over. So when I launch and go to H2O through Steam, when I try to load a data set from HDFS, I’ll only see whatever’s available to me as a user.
It’s very secure and very convenient. For DevOps people, there is a nice admin screen so they can see at any point in time which clusters are idle, for example, or which clusters are active. If there are a bunch of idle clusters, they can kill them or shut them down to give back resources to the other people.
Ronak Chokshi: I also wanted to mention that you can download any of these through our website. If you just go to the download tab on H2O.ai, you’ll find all of these over there as well.
Let’s switch topics a little bit, and show you how this is deployed by our customers, and how you would you use this in enterprises. This is an example where Driverless is deployed in the cloud. So from the left to the right, you have data integration and sources essentially sitting in the cloud. It could be a cloud data warehouse like Snowflake, and then followed by transformations and data quality checks by Alteryx. These are all of our partners.
The data is then fed into Driverless AI, where we do model development and feature engineering, and then the winning model is back into the cloud. Like we mentioned earlier, this can work for any of the three cloud providers that you see at the bottom of the slide.
Vinod Iyengar: Think of this as a reference architecture. If you’re a 100% cloud customer and you’re using a cloud data warehouse like Snowflake, or it could be BigQuery or Redshift, the neat thing is we have direct native connectors for all those cloud data warehouses and for all the cloud blob storage as well; you can ingest data directly.
We also partner with Alteryx, but in many cases, you might just directly bring the data into Driverless, or you might take it to one of these data prep tools, prepare the data, and then push it back, either via the cloud storage or, in some cases like Alteryx, directly into Driverless AI, because we have a native connector. Once it’s in Driverless AI, the models can be built. We have multiple modes of deployment.
We can have a one-click REST server that can be spun up on the instance itself, or you can deploy them into something like AWS Lambda, or into even something like SageMaker, for that matter. Similarly, we have a whole series of Driverless AI deployment templates, and similar ones for H2O as well. This allows for deploying models in different environments. You can do batch scoring, spin it up as a UDF, for example, and this is all enabled by the fact that both H2O and Driverless AI give you a MOJO, which is basically a seamless format that’s highly efficient and built for deploying anywhere. It is meant to be deployed standalone in whatever environment you’re on, and it doesn’t have a lot of requirements or dependencies.
Ronak Chokshi: You get lots of flexibility when you’re running pure cloud deployments, and multitenancy security, and so on. This is a hybrid deployment with Driverless AI. The difference here is, on the left where you have data sources sitting on-prem or part of it on-prem and part of it on the cloud, you follow the same procedure. Again, the deployments coming all the way to the right, the deployment could be in the cloud or on-prem using a REST server, et cetera.
Vinod Iyengar: This is a good example. Again, we want to highlight one other partner of ours, BlueData, that we work pretty closely with. They have a very nice product so that can manage provisioning of instances, resource management, and orchestration. They can do it both on-prem and cloud, and we work with a bunch of customers, together with BlueData. The neat thing is both H2O and Driverless AI are available in the BlueData catalog, so they are ready to be provisioned. We make sure that we test and update the latest versions there. In terms of data access, if you end up using BlueData as the DDAP connector, we have a native DDAP connector into Driverless AI, so you can ingest data directly.
Ronak Chokshi: We have very tight integration with BlueData.
Vinod Iyengar: Yes. Then similarly, we also have very good integration with IBM. So, one of our large customers uses IBM’s Spectrum Conductor. That’s a similar offering for orchestration and management of instances. This is a really large organization with over 100 users of Driverless AI who are all able to provision instances through the IBM Spectrum Conductor. It’s very convenient. Spectrum Conductor manages their on-prem data center, their hybrid data center, and they spin up their own internal cloud for Driverless AI instances. Then as Ronak mentioned, when it comes to deployment, you can basically deploy this anywhere. You can deploy this back into the cloud or on-prem. We have full flexibility as to where you can do this. In the same instance of Driverless AI’s style, it can ingest data from different sources. You can actually import data from Snowflake and also from HDFS if you’re running on-prem. You can set up both of those things.
Ronak Chokshi: We provide lots of flexibility in hybrid environments. Finally, on-prem. This is for cases where you haven’t yet moved to the cloud, and you’re completely on-prem. This slide shows you that we can connect into any of the HDFS Hadoop distributions, as well as leverage NVIDIA’s latest GPU hardware, Intel hardware, and IBM.
Vinod Iyengar: This is true for both cloud and hybrid as well. We partnered with both NVIDIA and Intel to make sure that we can optimize and accelerate all of our software on the latest GPUs and CPUs. When you upgraded to the latest hardware, our software is guaranteed to be optimized for that. You will get speed benefits, so that on the same amount of hardware you can now support more users, you can actually run more experiments. So that’s an important part because that can be a huge cost saver. If the software cannot take advantage of the latest advance in hardware, you would need to keep scaling up or keep scaling out, because the software is not optimized. Because we are optimized, we are able to give significant speed benefits, anywhere from five to 30X faster on the same GPUs and CPUs that others may not be able to. This means that you can provision fewer instances or fewer GPUs, and still be able to get the same amount of performance.
Ronak Chokshi: Hopefully you now have a good sense of the flexibility in the platform as well as the seamless integration that you get with Driverless AI. Before we jump into the demo, I just wanted to talk about a customer, G5. You can look them up at getg5.com. They’re essentially a real estate marketing company. They’re purely on AWS, purely cloud, and the use case was around lead scoring. They make life easier for real estate professionals, as they call people and try to get more and more business for real estate companies. The idea was to screen through the hand-scored calls. Those were the features. H2O was using the Word2vec model. This slide shows the data to prediction cycle, which is essentially creating the model and creating the call scoring model using Driverless AI on Amazon EC2. Then this slide shows how this was deployed. As the calls come in and the metadata is fed into the call scoring model, which is deployed in AWS, the agents get a real-time score of the customers. Now they have a very good understanding of the intelligent lead scoring of the customers calling in.
Vinod Iyengar: This is a great example of what we can do, as we can fit in with other tools in the ecosystem. In this case, they use both H2O open source and Driverless AI. Along with AWS’s Transcribe functionality, they were able to bring in the call recordings’ WAV files directly into AWS Transcribe, and get the transcripts through Word2vec from H2O. Because H2O and Driverless AI both support the MOJOs, they were able to take the MOJO artifact, use that to score and create new vectors, and give it to Driverless AI to build models. Then they could take the final MOJO and deploy it into AWS Lambda.
The neat thing is that all of this was achieved in a matter of months. They had a really small team that was dedicated to data science and machine learning. They were able to put this entire pipeline up on the cloud in just a matter of months, and they got significant savings. They saved nearly million and a quarter per month, purely on calls. There were handling up to a million calls per month. This was all done through AutoML and also by using all the tools that we were able to provide. So this is just an example of what you can achieve very quickly. This doesn’t have to be a year-long project. This can be done in a matter of months, and be tested, validated, and deployed into production.
Ronak Chokshi: Awesome. Let’s get into the demo.
Vinod Iyengar: Let me first jump in to show Enterprise Puddle. Puddle is basically the managed service for both H2O and Driverless AI instances on cloud. This is obviously a version of ours that we use internally, but when we work with large enterprises, we can set this up on their own VPC for them. They can have their own Puddle environment for data scientists.
There are a couple of things I want to point out. Obviously, I have a couple of instances. We’ll look at the demo in a bit. Once I go to this instance, I can actually see all the details of that instance as well. I can start a new Driverless or H2O instance. Let’s say I want to create a new Driverless AI version. I can pick the version of the software. All the previous versions are also available. This is really neat because as an IT or DevOps person, you can make sure that all versions are supported in your organization. The data scientist can go back and forth if they need to.
It’s very easy to spin up an instance, shut them down, and go back and forth. Once you pick an instance, you can give it a name. I can name it Vinod’s demo. Then you can pick an instance type, whether it’s a CPU or GPU, and this is all completely configurable. In the administration, when you configure it, you can set up to figure out what type of instances are available.
Depending on your IT budget and your constraints, you might offer big, large instances and small instances. This is also configurable at a user level. I might say, “Ronak has access to these two GPU instances, whereas the other user may only have access to CPU instances.” You can have different entitlements across the organization: user-defined controls.
Then you can specify the type of volume size. Again, whatever options are showed here are all configurable by the user. This is a really neat feature especially for cloud, because there is an auto shutoff if the instance is idle for either an hour or 30 minutes depending on what the user picks. Now this is a really important. We’ve seen lot of horror stories from customers where someone spun up a GPU instance or a large GPU, like multiple GPUs or multiple CPUs. They started an experiment and forgot to turn it off all night. They came back the next day or so and they were slapped with a huge bill. We have had customers who have run up hundreds of thousands of dollars, just because a few data scientists forgot to turn off their instances. We built this especially to avoid those kind of scenarios. The neat thing is, because we have signed in control into the instance, and we know exactly what’s happening, we are able to do this much better than what you can set up, like say, a cron job, because we can actually know when Driverless or H2O completes the experiment. We can automatically shut it down, and not just have to look at the instance at a higher load. We can provide any annotations.
I’m not going to create an instance, because I already have two running. Similarly, you can do the same for H2O as well. You can pick up an instance type, you can give it a name, pick the cluster or the instance size, and then the annotation. On the admin side, let me show you a couple of things. You can control the users, and you can control the systems. Let me click on systems for a second, right? Then you can specify which users are using it, and you can pick it up. More importantly, let me got to AMIs. This is actually where you can control what instance types and what AMIs are enabled. In H2O, we provide a whole bunch of AMIs for customers. Each of these AMIs is published by H2O. Some of our customers have essentially taken our AMI, made a few modifications, or added a few controls that they wanted. They can basically publish their own AMI, which can then be used for Puddle as well.
The first thing that we do when we work with your organization and set up a Puddle, is work out what your approved AMIs are, and then make sure that those are enabled. Within AMI, you can specify whether you want to enable Jupyter Notebooks, for example, or even enable security, or which ones are disabled or enabled. There is a lot of control for the DevOps and IT people. They can set it up.
In addition, we also give you access to a dashboard called Puddle Stats. Puddle Stats gives you a high-level view of all the different instances that are running at any point in time. You can also get a sense of the cost that we are running up. You can say that, “Hey, these H2O Driverless instances are costing about $5000 total in this time period.” That can enable you to control your costs if they are going out of control.
Okay, so now let’s go ahead to this one instance that I already spun up a little while earlier. When I start an instance, after it starts, this is what I see. I see the URL for the instance. We’ll go to that in a second. The username and password are created. This is configurable. Right now, these are created by default, but you can have your Azure Active Directory credentials or AWS IAM credentials on LDAP. Depending on whatever authentication mechanism you use, you can have those authentication credentials over here.
This is neat. This is a config.toml file. This allows me to go in and specify other things. For example, if I want to enable the Snowflake connector, I would do it over here, and I would provide whatever credentials I need to do so. So you can update the config.toml file very easily. You also have an SSH key, actually. If you want to actually SSH into this machine, I can do that with this command using the SSH key. If I want to do something in that instance, I’m allowed to do that. You get full control as well.
Then you can also get a quick sense of all the different models that have been run. This is a very high-level view. Even before you get to the actual instance, you can actually see what’s been happening in this instance, and I can see the status of the job. I might have started a few jobs overnight, and then I can come and see what progress they have had. This is very useful, and you can get a quick read on the system metrics as well, such as if CPUs are being highly utilized or not.
Let me jump into the instance itself. This is Driverless AI. Once I took the URL over here and populated a new window, this is what I get. I’m in EC2, obviously. I can pull data from different sources. Obviously, I’ve only enabled a few, but as you can see, I can pull data from S3 or from HDFS as well. I can pull data from on-prem, or the cloud, or upload the file directly as well. There are multiple ways to bring data in. Once the data is in Driverless AI, you can obviously run through the whole modeling workflow. I’m not going to spend too much time doing the modeling, because you can see that in other webinars, and we have covered this pretty extensively in the past. Let me just very quickly do a couple of things.
Let’s say you built a project for doing the machine learning workflow. Let me pick this one. You have a data set. You have multiple data sets for training a test. You build a whole bunch of experiments, and building an experiment’s very easy. You just click predict, it loads up the experiments, and you can use the wizard if you want, if you are new to the product. If not, pick the target column, which is weekly sales in this case. You can put in a test set. Then, I will specify what the time column is. Then after that, I have the opportunity to specify the grouping. I’ll let the software pick automatically. I can specify the horizon for forecast and the gap as well. Then, the three knobs for accuracy, time, and interpretability are all I need to change. Let me pick those and launch the experiment.
It will go through the entire workflow, first by checking for usage, any differences in the data, distribution shifts, any null or missing values, and then apply data fixes. As you can see, it’s detected a bunch of distribution differences between the training and test sets, and it is giving us warnings about those. You can turn on the warnings, or you can even force high-level permissions for Driverless AI. In the expert settings, you can tell it to drop these columns if you want, if it’s very different, for example. All of these are configurable over here.
We also provide leakage detection. For example, if you have a highly leaky column, it will give you a warning saying that, “Hey, this column is leaky. Consider dropping it,” for example. Then you can obviously control the different algorithms that are available. What do you want to include and what do you want to exclude? Similarly, for time series, you can include things like what kind of lags are allowed and what lags are not allowed.
NLP itself has a bunch of different algorithms like BiGRU TensorFlow models, and then character-based, CNN-based TensorFlow models. In addition, you can also bring in your own pre-trained embeddings for your own corpus of text, and employ it over here if you want to.
On the system side, you have a lot of fine-grained control on how many cores to use per experiment. You can choose how many GPUs to use, if they are available. Then, if you want you can enable detailed traces. This is extremely useful – to be able to see a lot of detail and what is exactly happening at every step. Similarly, recipes can also be controlled. This is a really neat feature of Driverless AI that was introduced in one seminar.
Then, the model management. Sorry, the remaining parts of it, we can control all the config TOML and whatever strings. You can add in some additional config if you want to. While this is running, we can see what features are getting generated, as they’re being built. Lot of the target lag features are lot of significance, and you can actually look at what those mean in some of our other webinars.
I want to point out a couple of things from an IT perspective. So, everything we do is being logged over here, and full system-level logging information is available. This is exactly what we have access to. You can go back and always look at what happened. Similarly, we only have simple trace enabled, but you can enable detailed trace as well. When we enable that, you can actually go to every single cell and see what is exactly happening.
For example, if you think that the experiment is running for too long or it is stalled, you can actually see a lot of detail as to what exactly happened. One of the neat things with Driverless AI is the ability to finish or stop an experiment. Let’s say I finish – it will finish the experiment, but more importantly, I will have the checkpoint available. This is neat. It’s a nice way for debugging your models and then restarting it after making your fixes.
Let me go now to an experiment that we finished earlier on the same data set, and quickly show you what happens after it’s done. So, once the model is built, you will see the experiment report being published. This experiment report is a really nice and easy way to actually see what’s happening in your data set. This is actually very useful, too. As an end user, you can basically see an experiment summary being created, like this, which has all the information that happened under the hood. You can open this report doc that gets generated, which is basically like a 20 to 25-page report that tells you all the information around what happened in that experiment, what data set was used, the settings of Driverless AI, the version of the software, the system specifications, all of the training data, and test data paths. Some of this just takes on the input data, and all the shifts that we talked about.
This is very useful for governance purposes. If you can capture this entire report, keep it along with your experiment, then you can always go back and see what happened. It’s really useful for lineage and governance of what exactly happened. Along with that, you can see other information on what treatments were done on the data. You can see what features were generated, and how many models were built. It’s a really nice report. We have a complete webinar on this if you want to understand more about what this report can do. One thing I would point out is that we can completely customize this report based on your requirements. So as an IT organization, you can specify what needs to be maintained for governance purposes, and we can customize this for that.
Jumping back to my instance, let me go to another experiment which finished a little while earlier. I can obviously interpret the results, and we have provided a lot of functionality around explainability and interpretability that’s extremely powerful. We have lot of good info on that if you’re interested in that topic. The neat thing I want to point out is the ability to download a Python or MOJO scoring pattern. This basically provides you with the entire pipeline, including the feature engineering and the models that got created, which is ready to be deployed in different environments. We provide the ability to deploy it in Java. There’s a Java runtime, there is a MOJO to Python runtime, and a MOJO to R runtime as well, in addition to an R and Python client.
Data scientists obviously come in different flavors. Some of them swear by R. Some of them are Python stars, and then there are many others who basically want to work directly in Java or somewhere, right? There are some others who like the GUI. In both H2O and Driverless AI, you have the ability to provide your data science users the flexibility of which environments they want to work in. All the models they produce and all pf the artifacts they create will all still be available in the same format, irrespective of whether they use the R client or Python client. Similarly, the MOJO runtime are just wrappers for that whole MOJO, which itself is the same. You can provision the conversion of the MOJO files, and then you can provide these runtimes on demand, when they are doing scoring.
Ronak Chokshi: I have just one more addition to that. Think of your enterprise running tens of hundreds of applications within all kind of programming languages. Regardless of what they are and which language, you can get help from this part of the demo, this part of the Driverless AI, and get MOJOs in Java, CEC clusters, and essentially, integrate those models into the production applications.
Vinod Iyengar: Perfect. One final thing I want to show is deployment. When I click on deploy and close these applications, I have the option to do AWS Lambda. This is a really neat way to launch on cloud. I can use my AWS existing environment variables with my access key, and then pick the region and deploy, and then deploy directly into AWS Lambda. It’s a very quick way to launch an endpoint. I can also do a local REST points. Actually, I can just specify the port number, specify the maximum heap size, hit deploy, and this will basically deploy the model that we just built as a local REST endpoint. This is a very quick, easy way to do one-click deployment of models, prototype them, and test them very quickly.
Obviously, this may not be the most optimal way for production deployments, because where your training is running is not where you might run your production. You can take these deployment templates, the REST, Lambda, and other deployment templates that we published. Let me show you that in a second over here. With this, I’ll get a pre-pop called DAI deployment templates. It’s open for everyone to see. Basically, as you can see, there are a whole bunch of different deployment templates – AWS Lambda, KDB, and local REST server. These are very easy to use. We have swagger templates available. You can take these to publish your models. Instead of running this locally, you can run it on a different instance. This is maintained by us.
I also want to point out our documentation – go to docs.h2o.ai. Let me go back and show that. This is where you come to documentation, pick up the user documentation and everything we covered today. There is a full setup on configuration, for how to do an install and configure H2O on Driverless AI in different environments, whether it’s on-prem, cloud, or on IBM Power. You have documentation on how to set up environment variables.
This is something that’s really useful if you are spinning up for a lot of users using containers. You can search for all of these things. Similarly, the data connectors. When I set it up, either through Puddle, Steam, or even if you’re doing it individually, you can specify all the different data connectors; enabling them is super easy. Once you enable them, all users in your organization will have access to that. You don’t have to teach the data scientist how to do this individually. Similarly, deployment of these MOJOs is very easy as well. As you can see, there’s a lot of explanation around how to deploy them on Lambda for REST endpoints.
That’s pretty much all of the things we wanted to cover. The model got deployed when you clicked it. As you can see, this is the endpoint, and this is where the model is running. Obviously, this is local host, so I can actually take this whole command and try it on an endpoint. I should be able to get a result. It’s very easy to test if your endpoint is working or not.
Ronak Chokshi: Great. Hopefully, this webinar gave you a really good view of how you can use Driverless all the way from downloading it, using it, creating instances using Puddle, creating machine learning models, and deploying them. We consider this a very easy-to-use software, and that’s how it’s been designed. That’s all we have.
Vinod Iyengar: Okay. At this point, we’ll open it up for some questions, if there are any. Please feel free to post your questions in the channel.
Patrick Moran: While we’re waiting for questions to come in, I just want to take this time to remind you all that presentation slides and the recording will be made available on our BrightTALK channel. Be sure to just visit the same link that you did to access this webinar to get those materials after the presentation is over.
Vinod Iyengar: Okay. One question is, how do you enable good provisioning, such as practices around provisioning on different cloud environments, and managing those versions? Obviously, we have different cloud such as AWS, Azure, GCP, Oracle, and multiple ones. Go to the h2o.ai/downloads page. You’ll see a whole bunch of different AMIs and templates being published by us. We maintain them for every version.
We do have two streams for Driverless AI, especially. One is the long-term stable stream, LTS releases, and then there is our latest table. For a lot of our large organizations the recommendation is often to stay on the LTS stream; that is tested, validated, it’s robust, and we continue to support any bug fixes, bugs, or hot patches for at least 12 to 18 months. That’s often a very stable stream for large enterprises to be on. If you want to set up an enterprise-wide cluster or environment, that’s a good place to be.
For a lot of customers who are interested in the latest features and who want to try out the newest products very quickly, the latest table is often a good place to start at. Even that has to be maintained, but we continue to add more features at a rapid pace there. I recommend using those templates directly from our page. If you go to h2o.ai/download, you’ll see the options to download and play with every single offering that we talked about today.
Patrick Moran: Okay, great. I’m not seeing any more questions come up, so I just want to say thank you to Vinod and Ronak for taking the time to talk today, and giving a great presentation. Thank you all for joining us as well. Have a great rest of your day.
Vinod Iyengar: Vinod Iyengar comes with over seven years of marketing and data science experience in multiple startups. He brings a strong analytical side and a metrics-driven approach to marketing. When he’s not busy hacking, Vinod loves painting and reading. He’s a huge foodie and will eat anything that doesn’t crawl, swim, or move.
Ronak Chokshi: Ronak Chokshi is responsible for industry vertical solutions and product marketing content at H2O. Prior to H2O, he was Product Marketing Lead at MapR Technologies. Ronak comes with 15 plus years of experience with cross-functional roles in startups, midsize companies, and large corporations, shaping solutions, strategy, and leading go-to-market execution across a variety of industry verticals.