Read the Full Transcript
Saurabh Kumar: Hello and welcome everyone. Thank you for joining us today. My name is Saurabh Kumar, and I’m on the marketing team here at H2O.ai. I’d love to start off by introducing our speakers. Vinod Iyengar is the VP of Marketing and Alliances at H2O.ai. He comes with over a decade of marketing and data science experience in multiple startups, has built a number of models to score leads, reduce churns, increase conversion, and many more use cases.
Our second speaker, Pratap Ramamurthy, has recently joined H2O. Before H2O, he was a Partner Solution Architect at AWS, where he helped create a machine learning partner ecosystem. At H2O.ai, Pratap helps customers architect the solutions on clouds, as well as solve data science problems. Before I hand it over to Vinod and Pratap, I’d like to share a few webinar logistics. Number one, please feel free to send us your questions throughout the session via the questions tab in your console. We’ll be happy to answer them towards the end of the webinar. Number two, this webinar is being recorded. A copy of the webinar and slides will be available after the presentation is over. Without further delay, I’d now like to hand it over to our speakers.
Vinod Iyengar: Thank you, Saurabh, for the warm introduction. Hi, this is Vinod Iyengar, here from H2O. As Saurabh mentioned, I work on the marketing team and also on some of our data alliances. I’m very excited to talk today about this topic, which is extremely timely in terms of the investments that are being made by enterprises. Before we get to that, a quick introduction about who we are.
H2O.ai is a growing, worldwide open source community. We have been in business for about seven years now. In the process, we have created this massive ecosystem of data scientists, developers, and enterprises, using machine learning in various [inaudible 00:02:38]. Close to about half of some of the Fortune 500 companies, including a lot of the top 10 banks, insurance, and healthcare companies, are currently using H2O open-source software. We also have an ecosystem of nearly 18,000 companies globally that use H2O, and close to 200,000 data scientists. Those numbers keep growing all the time. In addition, we have a very strong global data community, and we also put on user conferences, which we do periodically. The last ones in New York, London, and San Francisco were massively attended, both online and offline.
In terms of the products, we have two categories of products. On the left, you see our open source ecosystem, which is basically comprised of H2O Core, as we call it internally. Also, it’s popularly known as H2O 3. It’s a massively scalable, distributed machine learning platform, which is very widely used, globally.
You have Sparkling Water, which is essentially H2O Core running on top of Apache Spark. Then you have H2O4GPU, where we can accelerate some of these machine learning algorithms by taking advantage of GPUs. All three of them are completely open-source. Meaning, they are licensed in Apache V2, so you can take them and use them as you deem fit, both products on top of it, as many companies have done.
They are built for data scientists, meaning that you can use them through R, or Python, or through our H2O Flow interface, which is a very interactive data science notebook. We do offer enterprise support subscriptions for customers who wish to build their entire AI infrastructure using our open-source platform, and we help them in their journey.
On the right, you see our commercial offering, Driverless AI. Driverless AI is an automatic machine learning platform, where we take the entire machine learning workflow, the data science workflow if you will. Using what we’ve learned from open source, we are able to automate everything, from feature engineering, machine learning, interpretability, and deployment.
Driverless AI is built for a wider audience. We are able to democratize AI and machine learning by taking it to junior data scientists, domain analysts, even folks who are experts in specific analysis, and maybe even data engineers and developers. We automate their full, end-to-end pipeline, which means that you can go from data to production with very little effort. That’s a commercial offering that we sell licenses for.
Here’s a quick overview of our H2O open-source platform. So really, if you’ve not used this before, these are the reasons why you should consider using it. It is 100% open source. It is built to be completely compatible with the big data ecosystem. Meaning that we work with Hadoop, Spark, and all the versions that are being used properly in the enterprise. We also support all the popular data science interfaces. That means that you can use it through R, Python, Scala, Java, or H2O Flow, which is our interactive workflow.
At the core, what we’re offering are smart and fast algorithms that are completely scalable and highly performant. Meaning, you get the best-in-breed algorithms in a highly distributed fashion, so that you can pull as much data as you want to edit. As long as you have enough compute in memory, the software will scale out of these tools, and use all the memory and compute available at its disposal to solve the problem.
In the process, you can generate some highly accurate models which are extremely easy to tune, but also you then get the artifacts which are basically the POJO and the MOJO that are optimized for low data support. So you can go from running on your laptop and controlling a massive cluster with a small R script, and then going all the way to production with a model that’s highly optimizable. That’s the end-to-end pipeline which is being generated very seamlessly using H2O Core.
In addition, we enable GPU acceleration with some of these algorithms, and you’ll see them in the H2O4GPU package. Let’s get to the topic at hand today. The topic is “Considerations for Going on the Cloud.” Why should we care about this? If you look at some of the public research that has been put out, there are no surprises. Gartner, for example, states that cloud computing and services will be a $300 billion business by 2021. That’s billion with a B.
I personally believe that they are still understating the value, if you consider all the ancillary services, and all the support, maintenance, and consulting hours that go into this. That number is probably off by an order of magnitude, if you will. So there are massive amounts of investments that are happening on the cloud, and every company has a cloud setting, if you will. With that being said, what are the things that we need to think about? We’ll take a very machine learning and AI-centric view today. We can explain it a little bit more like a data-centric view, if you will.
Obviously, there’s enough literature out there that talks about cloud computing and cloud migration from a general-purpose view. We’ll focus on the data part of it, which is where our expertise lies. With that, I’m going to bring in Pratap to take over these points. He’s our in-house expert on cloud and database, with many years of expertise in database on the floor. He’s going to talk a little bit about the cloud journey that companies are on. Over to you, Pratap.
Pratap Ramamurthy: Thanks, Vinod. That’s a very good transition to the next slide, that is the cloud journey. So let me introduce myself one more time. I recently joined H2O.ai. We are considered to be the machine learning leaders in the market. Previously, I worked at AWS. I helped build the machine learning partner ecosystem, and H2O was the first partner that I brought into the ecosystem. So, I’m very proud to be here today.
Let’s talk about the cloud journey. When you are looking at a data science team that is trying to solve a certain problem, you have to look at exactly where your infrastructure is. This is not a black or white situation, and mostly it is a spectrum, right? I have seen customers who have been in a very cloud-native approach. They take a cloud-native approach to being in the cloud. That is, they start with the cloud, all the data is in the cloud, and all the software is in the cloud. They don’t have any infrastructure on-prem. It’s completely in the cloud. Going even further, to avoid any dependency with any single cloud provider, they actually use a multi-cloud scenario or multi-cloud strategy where they distribute the resources across multiple clouds. This has several advantages as well, right? That is what I call a top-tier, very cloud-savvy customer, where they have an alignment right from the top.
The second customer, or the next level, I would say, is a customer who has migrated to a single cloud. They have migrated all their data. They have migrated their software applications into the cloud, right? I’ve seen many customers who are in this situation, but most of the customers are the third category, where there is a hybrid cloud approach, where they traditionally have had data centers. They are in the process of migrating. They usually have a plan to migrate. They would usually hire a consulting firm to migrate their data and migrate their applications. We have several customers who usually use at least a one- to two-year plan. Sometimes they might make that a status quo as well. That is, they would decide to have some part of their data and software in the cloud, and leave some part of it on-prem. That is also a strategy that some customers like to use.
The fourth category is what I call fully on-prem. I think it’s kind of obvious here. There has traditionally been data and software in their own data centers, and it stays that way. So these are all the kinds of customers we’ve seen, and we support all of them. In whatever journey you are in at this point, this webinar is relevant to you. We have value to offer at this point.
Because we’re talking about the cloud, and specifically machine learning workload in the cloud, let’s first look at fundamentals of why somebody would even go to a cloud. There are four important value propositions of a cloud. The first is Capex versus Opex; a conversation, right? A trade-off.
Traditionally, if you have your own data center, you’re going to be purchasing all the hardware yourself, and you’re going to be paying for all the hardware upfront. Versus Opex’s operational expenditure, where you pay as you use, right? This is a very simple analogy here, and it’s the difference between purchasing a car by paying upfront versus hiring an Uber or Lyft. You pay for just your ride, right? That is a big difference here, where you could be renting out a server for just one hour, and paying 50 cents for that, compared to a data center, where you would have to purchase a $10,000 server upfront, before you can even start using it.
That’s the big difference in Capex and Opex, and this has enabled quite a few businesses to be able to change the way they operate. The second important use case or value proposition is scalability. The public clouds usually mean that they have data centers across the world. Usually, they have many, many data centers in each of these regions. Let’s say you’re building an application and you want to scale out your application across the globe. To be able to do that by yourself, by hiring people, building out this infrastructure is probably a multi-year plan.
Whereas if you are using the cloud, it’s just a few minutes. You can scale out your application in a few minutes, across the world. That is this kind of scalability that it provides. The third important angle is flexibility, and being agile. Even though I have done this several times in my career now, let’s say you want to do a pilot program. You want to experiment with something, to show it to your manager or management.
You start off with one instance. You do certain things, and it’s not fully baked, but it is a proof of concept. You quickly build this and show it to your management. Then once you get your approval, you can then expand or scale out your PoC into a full-fledged service, or you can shut it down. The fear of running an expensive experiment is not there anymore, because your experiments are not going to cost you much. You can run an experiment for under $10. Run a PoC for $100, and you can get shut down if it doesn’t pan out, but you can scale out and go full-scale, if it really works. That is the flexibility and agility that really emboldens people to be more explorative, and experiment with newer technologies and newer ideas that help you to be more successful.
The last thing is to be able to try out latest technology. Be it GPUs, be it FPGAs, a new kind of hardware, or a new kind of algorithm, it’s easily accessible through the cloud. All you need to do is sign up with your email address, and submit your credit card information, and you would be able to use the latest technology within a few minutes. That is something that is also very attractive for customers.
These are the four main attractive pieces, and it all seems to be aligning very well for the new generation of enterprises. That is the reason why many companies choose to go the cloud route.
I want to contrast that with the value proposition of data, or software on-prem. There are certain advantages for this as well, but really, that should be contrasted with the value proposition of the cloud. The first thing is that the infrastructure is already paid. Let’s say you have a data center, and you’ve already purchased hardware, and you want to run a certain workload that is going to take 20 servers. Now you don’t have to worry about how much is it going to be charged per day, or per month. You won’t have to think, “Is my CFO going to come and complain about this new expenditure?” Because the hardware is already paid for, you don’t have to worry about what the billing is going to be. It’s a little weird to think this way, compared to the previous method, but it’s more of a billing construct, and at which point of the billing sale you are currently in. If you’ve already paid for it, then there is no worrying about the cost; you cannot overshoot your cost, because your cost of using this hardware that you’ve already paid for is zero.
The second point is that if you’re in the cloud, if you’re running serious workloads, you need to have some kind of a cost management solution there. That is, there are now specialized partners who work on reducing your costs in the cloud. This actually kicks in very early on. We don’t want to leave it after you have burned a million dollars in the cloud. You probably want to start very early on. This is not a concern if you are on-prem, because again, you’re not going to be using resources and then finding a shock in your bill, because there is no billing to be handled here. There is a hard limit on how many resources you can use, because also, this is your hardware and you have paid for it.
There may be some conversations you might have to have with other departments in your enterprise. Again, the problem is very different here. It’s not like you have to worry about what might happen if you just let your instance run overnight.
The third interesting thing is GDPR, or HIPAA, PCI compliance. It is tricky to say that on-prem is all GDPR-compliant. You can be on-prem and still not be GDPR-compliant, and you can be in the cloud and still be GDPR-compliant. It goes either way. Let’s say you are traditionally compliant. You are compliant already, and you’re on-prem. If you are trying to move to the cloud, then you have to revisit these guidelines. You have to re-analyze what to see or run an audit to see whether your new service that is going to be running on the cloud, is going to be compliant with your existing certifications. That is very important.
Security is very important, especially data security. So this is something that needs to be considered really well before you move any customer data or critical data to the cloud.
The last point there is constant loads. Let’s say you have an application that takes one instance, and it needs 10 gig of RAM and two CPUs, and it’s going to be running continuously for the next two years. The best use case there is to use a resource on-prem, because you know that this is a constant use case. The cost there is the best if it runs on-prem. There are conversations about whether this is also feasible on the cloud. It is true. You can have a similar cost running on the cloud as well, but I would say it’s on par. That is something that you need to consider along with the other concerns.
So, these are the cloud versus on-prem considerations. As you might have noticed, I tend towards cloud use cases. The needle is tilting a little towards the cloud, and I strongly believe in that. Those are the differences, in general, between what makes sense on the cloud and what makes sense on-prem. Let’s focus a little bit more on the machine learning workload, or data science workload on the cloud.
The first point there is, for you to be able to do machine learning, you need the data, right? Data is what you can run your machine learning algorithms on, which is what we call data gravity. You can run your machine learning algorithms only in the location where you have the data. That means its physical location, right? If your data is on-prem, or let’s say, if your data is on S3, on a specific region, on AWS, that is the region in which you can run your machine learning workload and train your models.
You cannot run these machine learning models somewhere else because you need to access the data. There might be even heavy access. So proximity to data is probably very important. What that means is, before you can say that you can start to run your machine learning workloads on the cloud, you need to make sure your data is up, and ready to be used on the cloud, in the location that you are interested in.
This also means that you might have to work with your data lake operators and administrators and see if you can move your data to that location, to the cloud. That could also means moving the data, or actually collecting the data in the cloud itself, or have a secure connection from your on-prem data storage to the cloud, on wherever you’re going to be running your models. That is probably the most important aspect of running your machine learning workload on the cloud.
Think about where your data is. You should also, at this point, consider frameworks. There are lots of newer frameworks that appear on the market, almost on a weekly basis now. H2O has released H2O Driverless AI, which has lots of new features. There are new frameworks that are coming on to the market every week. You might want to try that out.
There are new ML models, new algorithms, and new upgrades. If you’re in the cloud, this lets you try out your new technologies and new frameworks quite rapidly if you’re in the cloud, versus if you’re trying to do this on-prem. You are now stuck with asking somebody else to provision resources for you, before you can start, when trying things out. That also could affect your time to market, if you are building something for your team; it could delay the process of creating new products.
The third important thing is newer hardware. You also need to worry about what new hardware that you may need for your machine learning workloads. This could mean newer types of CPUs, newer GPUs, or even TPUs. You can also run quick trials and PoCs on the cloud. You can evaluate whether these technologies make sense for your use case.
There might be a much newer GPU that might make your workloads much, much faster, but with only 2X times the cost. Overall, it might actually be saving you money and saving you time. You can make this evaluation on the cloud very rapidly. That is also one of the things that makes it very attractive for machine learning workloads especially, to be in the cloud.
The most important thing with machine learning workloads, especially during the training phase, is a classic example of a bursty workload, right? It’s impossible to accurately predict how much resources you would need. Also, when you’re running your machine learning or training, you might be using eight GPUs for two hours, and then you could be shutting it down. This is a very bursty workload, and it does not make a lot of sense to purchase new hardware and have it in your on-prem data center, just for this bursty workload, because your requirement could change very rapidly. It’s best if it’s in the cloud. It’s what I would call a “perfect marriage.”
Let’s look at how Driverless AI works in the cloud. There are four steps here. I’m going to showing what the components are of running a machine learning workload, and which of these pieces of the machine learning workload can be in the cloud versus on-prem.
There are four stages here. On the left side are several clouds that have the data. One example of this is Snowflake, which is a partner of ours. We have Snowflake connectors that collect the data. Then you probably also need to work on data quality. You also will want to work on the transformation. You could use Alteryx, also another partner of ours. This can be done on the cloud.
Then the third thing that is important in this pipeline is what we call the Driverless AI. It’s marked in the gray box here, which includes feature engineering, and then building the model itself and training the model. This can be done on the cloud, on GPU instances, or whatever. Once that is done, you have the model that’s built, and then you can deploy the model in the cloud, either on an instance, or you can have something like a Lambda function or a cloud function. All of these pieces can be done completely in the cloud. You do not have to be on-prem to do any of these.
Driverless AI is available on the marketplace, in Microsoft Cloud, Google Cloud, and AWS. This is one scenario where it could be done. The next example is what is a hybrid use case. Let’s say you want to use both on-prem and the cloud; and how could you use it? This is an example, where on the left side, you see data that is coming in from the cloud. This could be from Redshift AWS, it could be BigQuery on Google, it could be Snowflake, or it could be an on-prem Oracle database as well. You need to collect this.
Some of this could be on-prem, like an Oracle database, but you need to be able to collect this and create a data set. This data set can either be in the cloud or it could be on-prem. This is a point where you need to worry about how you are going to transfer the data from the cloud to on-prem, or other ways. This can be done. You can use Alteryx, any data manipulation, or data transformation tools.
The next phase is the model building phase. I would highly recommend that you use the cloud for building your model. This is because it would be beneficial if you use multiple GPUs. This could save you time, but there are bigger advantages so you should do a GPU to time trade-off.
Let’s say you urgently need to build a machine learning model in the next two hours. You can decide to throw in four GPUs and get the job done, versus another day, when you’re not under time pressure. You could let it run on a lesser GPU, or just a CPU instance, overnight, and see what happens the next day. You get to choose this flexibility. You have this flexibility in the cloud. So that is why I would prefer that the model-building capability be in the cloud. If that is the case, then you would also have to make sure that your data is also being collected and transformed, or that the final data is in the cloud.
Let’s say you have built your model. Your model is just a MOJO or a POJO file, in Driverless AI or H2O. Now, this can be deployed anywhere. This is just a very small file, and this file can be served, or you can do the scoring or inference on a cloud instance. Or you can bring it on-prem, run it on an instance, or run it on hardware that you have in your data center. This is a good decoupling point here, right? The model can be run anywhere, versus in the lab. The first three stages of the pipeline are a little tightly coupled, because there are a lot of data transfer charges. There are security issues, whereas this is a single file that is getting transferred. That is something to remember.
The last case is if you have decided to move completely on-prem: you can have your data centers and your data on-prem. That doesn’t mean that you’re traditional, and you’re not using the latest technology. You could be using a partner of ours, like MinIO, which gives you S3-like connections – data storage on-prem.
You could be using Cloudera, or you could be using Hortonworks on-prem; you could have a data lake on that. Also, you can do data transformation, but you could also run your Driverless AI on your internal servers with NVIDIA GPUs or even IBM hardware. This is also supported, and we also have worked with several partners that are exclusively o- prem.
That is also possible. It depends on what exactly you want to do, what your goals are, and what your long-term plan is for moving to the cloud, or staying on- prem for various reasons. I’ll now hand it back to Vinod for the next slide.
Vinod Iyengar: Thank you, Pratap. With that understanding, let’s look at H2O’s Driverless AI platform, and see how that will play out in an example. Let’s say your Driverless AI’s platform was charged for automatic machine learning. We do have a whole bunch of data connectors here. You can see that there’s a bunch of stuff that’s on-prem, some hybrid, and some cloud. You can bring in data from your HDFS cluster that could be running o- prem, or even on an Azure data lake, for example. You can use either one.
You can bring in data from your local cloud storage, or something like MinIO, as Pratap pointed out. We also have connectors for things like Snowflake, Google BigQuery, and Amazon Redshift. You can bring in data from your cloud data warehouses to us. Once you’ve done the ETL and the data transformation prep, you bring it to what we call a modeling-ready data scrape. At this point, you have a bunch of features and a target, which is one of many targets that we want to predict. That’s your modeling-ready data set.
Then at this point, you can then run Driverless AI to do a whole bunch of visualizations. This is the first step of your data science workflow. Part of that is done by Driverless AI automatically. We run through a whole bunch of these visualizations to find interesting things, like missing data, or any outliers in your data, and look at the distribution itself. We find out if there are any weird distribution patterns in there. We look at the cross-correlation across all the different features to help figure out if there are any strong correlations that are being seen, even before we start building the model itself.
Based on what you find over here, you can go back and fix your data set, if you will. After that, you run through the actual machine learning pipeline. This is where we automate the whole workflow. Typically, what this entails for data scientists is doing some feature engineering, which is basically transforming the distinct features to make them more well-suited for taking advantage of machine learning algorithms. Two basic things include one-off encoding, to target encoding, for example, or you can combine a few features to create a hierarchy feature that’s more well-suited for some of these tree-based algorithms or neural networks.
We do that all automatically for you. This is where some of the art of the data science comes in as well, where you might do these feature engineering steps in a way that alerts users that there is a problem at hand. For example, if you see an i.i.d. problem, then you might do a certain set of transformations. Whereas if you see some time series for data coming in, you might do things like creating lags, or bigger pictures that will look into the past for detecting micro or macro similarities.
Those are the things that we could do in Driverless AI, automatically for you. You run through a whole bunch of algorithms to find the right set of models. Then we tune those models as well. This is an iterative process. It’s feature engineering, try the algorithm, tune the model, then see how the performance is. You go back to the drawing board, to do that again and again. In no time, you get a pipeline, which is a highly optimized set of features, algorithms, and the tuning parameters to fix the problem really well.
Once you have that, you have two next steps. The first step is to actually generate the model pipeline. This is what will go into production. Your scoring engine or artifact that is either in Python or Java. In this case, we provide both of those options. You can then deploy this model into a real-time scoring environment with the REST server, or provide it on cloud, for example, as a Lambda service. You could also do batch scoring with it. This where you do some of the things that Pratap mentioned earlier, like finding some elastic load balancing, and putting it in a different place for a cloud environment where scoring could be useful.
We also then do machine learning interpretability on the models, which is a good way to help understand how the model actually performs. Meaning, we learn what the top reason codes are for each prediction. We use that to go back and revisit to see if the model is doing the right thing. If you see that there are certain features that are coming out that are significant, which shouldn’t have come out, you can go back and tweak that in the pipeline. So this is extremely useful for diagnosing and addressing bugs in a model, or helping get a feel for the fairness or any bias that might have crept into the model.
That’s a very important point. We do this automatically for you. We generate these interpretations for you. At this point, I’m going to skip into an example of a customer. One of our highly valued customers is called G5. You can go to their website, getg5.com, to learn more about what they do.
At a high level, they are a real estate marketing company. The service they offer is for operators or property managers. Their customers are property managers who manage single family properties, multifamily properties, or senior living facilities. What they are trying to do is manage their advertising campaign. They are running ads on Google or on other social networks, and they’re generating leads out of those ads. Based on those leads, they have to qualify those leads to find out which ones are most optimal. They want to see who is interested in buying or renting the property, and then they’ll pass those leads on to the property managers. This is the use case that we are trying to solve for their existing clients.
The goal for them was to figure out from a call, mail, or touchpoints, if that lead is a hot prospect or not. How do they do that? They’re trying to measure the marketing efficiency, but they also do the lead qualification so that their direct specific clients can actually go after the hot leads very quickly.
So, what that basically translates to is intelligent lead scoring, but they can also pare it back to help optimize their marketing spend. As your leads are coming in, you can optimize and see if your lead scoring is telling you whether the leads are good or not. If they are, then you spend more on the channels that are giving you the better leads, and drop spending on the ones which are not. It’s doing this all in a real-time fashion.
How do they achieve this? They ended up being completely on the cloud. As the calls were coming in, they were being stored as audio files. Then these calls were stored into Amazon S3. They had these call files coming in, and then they were using a transcription service to take the audio file and convert it into text. So they have a text transcript of a call.
Then they were using H2O 3’s word2vec model to generate features. Basically, those are meta features representing the keyword embeddings for the call. These were then generated on EC2 with H2O 3 running on it. Once they did that, they had some metadata about the call. Things like the duration of the call, the time of the call, day of the week, and time of the day. Which area the call came from? That metadata, along with the features from the actual call transcript itself, were word2vec model.
Basically, they used EMR to generate a cluster, and then ran H2O 3 on it to do the data managing, using Spark plus H2O. Once that was done, they ran it through H2O Driverless AI to build a model that would basically use all the metadata and the call transcript features to predict which calls were highly likely to buy, or had higher intent of purchase.
So, they were able to efficiently identify caller intent. Once that was done, those models were then deployed using AWS Lambda to provide real-time scoring. It’s a real-time application that takes a phone call, transcribes it, texterizes it, scores the call transcript plus the metadata with the model that’s built by Driverless AI, and gives you a call score. That call score essentially populates their scorecards, which are used by their end customers to see if the call is hot or not.
You’ll see something like this, where you can see that the hot call might be ranked and color coded to show that this person is a hot prospect. What that boils down to is essentially a million dollars plus savings per month, out of the box. The key to remember is that you are able to do this with a staff of just two technical folks. These are basically data engineers and one data scientist on the team. Essentially, a very, very small data science team was able to pull this off in a matter of a couple of months.
There are a few reasons for this. The first is that the entire data science workflow is completely automated using Driverless AI. So they got the power of an expert data science team in a box, essentially, doing all the machine learning work for them. That resulted in a really highly accuracy model, like 97%. That’s actually better than even having a team of humans evaluating those calls and evaluating whether their caller intent was good or not. In this case, the model was actually beating their team of humans in a call center, doing this evaluation. More importantly, because they chose an infrastructure completely on cloud, using cloud-native elements, and also using software like H2O 3 and Driverless AI, which were already available on-prem for them, they were able to put this entire infrastructure up in place in a matter of months.
In a couple of months, they did the whole machine learning, plus the infrastructure to set up this application and generate savings. This goes to show that this is not just a theoretical cloud experiment. You can actually pull this off, even for medium-size companies or even large companies. Again, learn more about that particular use case in a previous webinar that we did, with AWS and G5. We’ll put the link in the notes.
At this point, I want to invite Pratap to share some thoughts on what new cloud consumers should remember for any practical applications, or some of the other things we thought about. Pratap, do you want to talk about these?
Pratap Ramamurthy: Absolutely. So let’s say you are a new cloud user, or you’re just creating a new account. You might be an existing cloud user, but you’re trying out something in a new account, for a new project. Here are some things to remember. The first thing is, start small, and then you can go bigger, right? One of the mistakes that I did initially when GPUs first came out was that I really wanted to try them out. I started with the largest instance type, and I was doing something, and got distracted. Then I came back a week later, and found that this instance was still running. You don’t want to make those kinds of mistakes. I’m sure many of us have done that. So start small. You can even use just a CPU-based instance, maybe. Also, try to use auto shutdown. There is a feature called auto shutdown which says that if the CPU is not getting utilized for more than one hour, it would automatically shut it down. This is a feature that’s available in the cloud; AWS has it. If you have that feature, you can shut it down.
The second thing is, let’s say you’re going beyond just one instance. You want to minimize having an orphaned resource. An orphaned resource is something you start and then you forget to shut it down, and then you move on to the next project, and so on. These are resources that are being billed. The thing is, even if you have more than one person in the team doing this, you would actually forget who started what. There would be lots of resources running that are mostly unnamed.
It’ll be like instance one, instance two, and so on, and nobody would remember who exactly started these instances. Everybody would think it’s somebody else, and you would not want to shut these down. This is a completely wasted resource. You want to avoid having any shocks in your monthly bills.
Let’s talk about the administration of resources and data. You need to have some kind of a basic administration. Do not use a root account for your clouds. Always use an IAM user, which has lesser privileges than a root, because a root account has all the privileges. It can do everything in the cloud account. Always use an account that has lower privileges. Make sure that the privileges are actually as minimal as possible. Only have privilege as needed. If you do not need to have unnecessary privilege to access resources or services that are completely unrelated, do not have that. Just have some basic security practices in place. Let’s say you’re starting an instance. For example, if you use AWS Driverless AI, it runs on port number 12345. It’s easy to remember, but the thing is, you only need that port. You do not need all the other ports.
By default, when I create the image, I only open that port; all the other ports are shut down or blocked. Very simple, right? Even that port can be configured to be accessed only from your device. You don’t want to open all your ports. The worst thing is opening up all of your network ports to the entire world. That is inviting trouble.
The fourth thing you want to remember is cost estimates. When you are launching an instance or some resources, there are two pieces to the estimates. One is infrastructure – that is the instance-type, and probably the storage volume that you’re attaching, and so on. That is all part of the infrastructure, and there is also the piece for software fees or licenses. Mostly, they would separate and show you these, and then add and give you the total.
You want to be aware that these two are separate. The fees for infrastructure varies by the size of the instance. The software fees not be linked to that; they might be orthogonal. You want to be aware of what exactly you’re doing, how it’s going to be billed. This is usually something that’s shown to you before you launch, or at the last step of launching an instance or resource.
The last thing that you want to consider is portability. Let’s say you’re working on a machine learning platform problem, and you’re choosing a cloud. Let’s say this is not finalized. You do not know if this is the cloud you want, or if it even will be in the cloud or in the data center. You really want to make sure that you’re not painting yourself into a corner with the project that you’re doing. You cannot say that, “Oh, it only works on this cloud. It’s not portable across the cloud, or you cannot bring it on-prem.'”
This would be a project-killing factor that would eventually damage the project’s long-term vision and long-term growth. What you would ideally want to do is choose a machine learning platform that is portable across the clouds and also portable on-prem. Even portable in terms of bringing it to your own laptop. In H2O, we make sure that all of our products are like this.
Driverless AI and H2O can run anywhere. They can run on your laptop, they can run on-prem, on your Intel CPU, on a data center, on your own data center, or it can run under the same software. You can run it on any of the public clouds. You have this portability. So you don’t have to worry about painting yourself into a corner by choosing to do a trial run on one cloud. You don’t have to be worrying about that when you’re using H2O’s products, because it’s completely portable across clouds, on-prem, and even on your laptop.
Those are some of the things to remember when you are evaluating or just starting up with running a workload on the cloud. Vinod, do you want to take over this slide?
Vinod Iyengar: Sure. If you’re starting with H2O on the cloud, you have quite a few options. All the three major cloud – AWS, Microsoft Azure, and Google Cloud are available. All three products are available, so if you want to use H2O Core, there’s a VM for H2O Core in all three clouds, and in the AWS Marketplace, Azure Marketplace, and Google Cloud Marketplace as well.
Similarly, Driverless AI also is available on all the three clouds, and Sparkling Water is available as an integration to the cloud-native, big data infrastructure packages. So for example, on AWS, we have Sparkling Water plus EMR cloud formation templates that you can use to get started very easily. On Azure, you can use the Sparkling Water on Microsoft Azure HDInsight to get started on a big data cluster and do your machine learning on Sparkling Water.
On Google Cloud, you can use Sparkling Water with Cloud Dataproc, which is Google’s own native, big data infrastructure solution. As you can see, there are cloud data solutions which are very optimized for the specific clients.
Finally, here are screenshots of the three marketplaces. You can go ahead and pick it up. In addition, we also have a couple of offerings especially for customers who want to get started on a quick trial. If you want to do a quick trial or a PoC with H2O’s offerings, you can do couple of things. The first thing you can do is try a test trial of H2O on AWS, with an offering called Aquarium.
That’s our training platform, so you if you have been to one of the H2O conferences or H2O trainings, you probably have had a chance to play with it. That’s a very quick way to try Driverless AI or H2O 3 on a cloud instance for a couple of hours. We also have something called Puddle, which is available for trials and PoCs. If you are just trying one of our offerings on a cloud-native interface without having to go through your own setup, you can try that as well.
If you are evaluating or running H2O or H2O’s products on the cloud, that’s a quick way to try it out. Contact our sales team or one of our marketing folks, and we can set you up with either of those two. With that, we’ve come to the end of the presentation. I think we have a few minutes for taking a few questions. Let’s look at the questions that are coming up.
There are a couple of questions that have come up, both online and offline. There’s one here – I think we answered the first question already, which is, “Are H2O and Driverless AI available on the cloud?” Yes. They are available. H2O 3, of course, is open source, so there is no license cost for running H2O 3 on any of the clouds. You only have to pay for the infrastructure. If you do want to buy enterprise support for maintenance and getting help through the AI and machine learning journey, we offer that for our H2O and Sparking Water customers. For Driverless AI, the VM itself is free on all the clouds, but you have to bring your own license. If you got a trial license from us, you can use that on those cloud instances. Eventually, if you do end up buying a commercial license from us, you can use that license as well to run on the public clouds. These are our models. Hopefully, that answers your question.
The second question here is interesting. The question is, “How can I pick the right hardware or instance type for a particular ML workload?” This can be a little tricky, because you need to understand what the consumption patterns are for the ML framework that you pick. For example, for H2O Core, we typically recommend compute and memory size requirements based on the size of the data set. Typically, it’s a multiplier. If you have, say, a 10-gig dataset, we recommend at least 4 to 5X the amount of memory to run that in an efficient fashion.
Of course, compute is also a separate consideration. We provide these sizing guidelines. Most enterprise-grade frameworks will have some kind of a sizing guideline. It’s probably useful to figure out size dataset you’re going to run, and get those sizing guidelines from the vendor you are choosing. In our case, in H2O’s documentation, we have sizing guidelines for both Driverless AI and H2O 3. You can use that to pick the right hardware instance. That’s the benefit of doing it in cloud. If you find that your experiment is really slow, you can always shut it down and spin up a bigger instance. That’s tough to do on-prem. You have to make the decision of prem, whereas here you can try out. It’s cheap to make a mistake. You can shut down the instance in an hour, and then go and start a new one.
Pratap Ramamurthy: You captured all the considerations for sizing the instance during the training phase. I think you had a very interesting scenario with the customer use case, G5, with the deployment phase. After you’re done with the training, and when you’re trying to deploy the MOJO or the model, now you don’t have to have such a large instance type, right? You only need to have a much smaller instance type.
Also, you can even deploy it into a Lambda function or a cloud function, which could bring down the cost even further. The type of workload now is a very different deployment because it’s going to be continuously up. So this is going to be a smaller instance, but it’s going to continuously run; this is a bursty load. You might want to have, like you said, five times the memory requirement of the data set, but the deployment can be much smaller.
Vinod Iyengar: Excellent point, Pratap. That’s definitely something to consider as well. Again, find out the spec or design pattern for your application. So, what kind of latency do you need for inferencing? Are you doing batch scoring or real-time scoring? Having a good understanding of what your end user requirements are for inferencing can help you pick the instance type or function type for that. As Pratap mentioned, that can be a much, much cheaper instance, but it’s going to be running all the time. The considerations there are a little different. Having a good understanding about the kind of data you want to bring in, and what kind of model application you’re trying to build, will help you size these.
I think we have time for one last question. Pratap, maybe you can take this one. The question is, “How do you migrate the data from on-prem to the cloud? Are there any best practices around it?”
Pratap Ramamurthy: Oh, yes. Absolutely. Like I mentioned before, if you are trying to run your machine learning workload in the cloud, you want to make sure that the data is in the cloud, right? You cannot run machine learning without the data. This is not reinforcement learning, where it’s just exploring the space. This is building models from data.
So there are a couple of ways to do this. The first thing is that you can move your entire data collection pipeline, from ingestion to processing to storing, into the cloud. That is a very cloud-native way of doing things. If that is the case, I think that is a trivial answer for this. This question is based on the fact that there is some data that’s outside the cloud, right? In that way, there are two things to consider.
One is, you can migrate the entire pipeline to go on the cloud. That includes ingestion, processing, and storage. The second way is to have some kind of an ingestion pipeline on-prem, or whichever part that is on-prem, you stay on-prem. What you can do is have a final output file, or single file, or a bunch of files that can be securely transferred to the cloud. If you guys remember, transferring data into the cloud is free, mostly, because data in is always free; data out is expensive.
That can be done, but the problem with this case is that now you’re taking data from on-prem and transferring it out to outside of your data center. This is going to raise a lot of red flags in your security enterprise. The cyber security team is going to flag this. This is actually not the best way.
The ideal use case is to build a data lake with boundaries that include your data center as well as the cloud, by having a secure link between your data center and the cloud. For AWS, you can use DirectX connections or a secure VPN connection, whereby now you have a pipe from on-prem to the cloud. If you have this, then the situation is very different, where the data can freely move between your on-prem to your cloud, or you can bring it even on demand.
Let’s say that just before you start your machine learning training, you can pull the data from either the cloud data lake, or you could bring it from your on-prem data center. The idea is that the boundaries of your data lake do not stop with your data center; that’s the important thing. Data security is the main thing to consider here. You do not want to transfer data directly into the cloud, which is not the best practice. You want to work with your cyber security team to make sure that is available.
Saurabh Kumar: Very good. Thank you, Pratap.
Vinod Iyengar: I was going to take this question, “Do you guys support REST API and clients for the trainable models?” The short answer is yes. We basically provide you with the artifacts for spinning up your own REST servers. We provide you a file with a wrapper, and you can use that to spin up your REST server.
If you go to your documentation (we’ll send those links), we have a whole bunch of deployment templates published, which include tips on how to spin up your REST server, how to set up some functions, and load balancing.
If you use the latest version of Driverless AI, we also have a one-click deploy to AWS Lambda, and the one-click REST server is coming very soon. It’s available in the release patch, but will be available very soon. Pratap, did you have something else?
Pratap Ramamurthy: Yes, someone is asking if the slides will be ready for viewing later on.
Vinod Iyengar: Yes, so all the slides will be provided, and the recording will be sent to all folks who are registered for that session.
Saurabh Kumar: Thank you, Vinod and Pratap, for taking the time and doing a wonderful presentation. To add to the availability of slides and the recording, they’ll be available at the same URL that you used for the webinar. You can also find that on our BrightTALK channel. Again, thank you everybody. Have a good day.
Vinod Iyengar: Vinod Iyengar comes with over seven years of marketing and data science experience in multiple startups. He brings a strong analytical side and a metrics-driven approach to marketing. When he’s not busy hacking, Vinod loves painting and reading. He’s a huge foodie and will eat anything that doesn’t crawl, swim, or move.
Pratap Ramamurthy: Pratap is a Sr. Principal Solution Architect at H2O.ai. Pratap has a Masters in Computer Science and Electrical Engineering from University of Wisconsin, Madison. Early in his career, as a research scientist he worked on using game theory to solve network congestion. He has authored several papers and holds three patents. Before Pratap joined H2O, he was a Partner Solution Architect in AWS, where he helped create the ML partner ecosystem. At H2O.ai, Pratap helps customers architect the solution on clouds, as well as solve data science problems.