Read the Full Transcript
Hello. Good morning, good evening, good afternoon, wherever you are across a variety of timezones that we have people attending from. Before we delve into the topic for the day, I’d like to do a quick sound check. So, if you’re able to hear my voice, please type a yes in your questions tab, so that we can have some confirmation on the audio being.. Great, there’s a few confirmations that I’m being audible. Wonderful. Again, a very warm welcome to everybody joining us today, our webinar titled AI in Healthcare: Ongoing Lessons and Best Practices for Better Outcomes. My name is Saurabh Kumar, and I am the APJ marketing manager here at H2O.ai. I’d love to start off by doing a quick introduction of our panelists today. Leading the discussion today will be Niki, whose joining us all the way from New York. Niki is a customer data scientists at H2O.ai with a passion for data-driven knowledge. Coming from a PhD on the microscopic universe of bio molecules, Niki is bringing scientific thinking to real world big data. Niki has experience in healthcare among other sectors and loves to work in inter-disciplinary teams.
Next up, we have SRK, Sudalai Raj Kumar, better known as SRK. He’s a world renowned data scientist and a Kaggle grandmaster. His work in the NLP space has won him many accolades and possibly one of the largest followings in the data science space.
Rohan Rao. He’s a machine learning engineer and a Kaggle grandmaster with over six years of experience building data science products in various industries. He’s an IIT-Bombay alumni and is hailed the world over for his wins in various local championships. His dream is to make Person of Interest a reality.
Shivam. Shivam is a data scientist at H2O.ai and again a Kaggle grandmaster. He is a three times winner of Kaggle’s Data Science for Good competitions and winner of multiple other AI and data science competitions. Shivam has, again, extensive cross industry and hands-on experience in building data science products.
Before I hand it over to Niki, the following are a few housekeeping items. Please feel free to send us your question throughout the session via slido.com. Joining instructions are right here on my screen. Slido.com #lessons. This presentation is being recorded. A copy of the presentation will be distributed shortly after the presentation is over. Now, without further ado, I’d now like to hand it over to Niki to kick off the discussion.
Thank you very much. Good morning to everyone from New York, it’s a pleasure to be in this panel. We have held a couple of panels previously because of COVID-19, as a response to COVID-19, touching upon very different timezones. So, I’m very happy to have the opportunity now to talk to our friends and customers in this part of the world. Can I escape? To share my presentation, I have to change the presenter.
Yes, absolutely. I’ll just do it right now.
So, we are… Thank you. So we’re here today because of the COVID-19 pandemic, but even more than that, this pandemic has shown that the data, the data science, data modeling can be an invaluable ally in our fight against the disease. In this case, against infectious disease. Nothing explains it better I think than this quote that they took from Nature by the former World Bank and World Health Organization chief Jim Yong Kim, who I quote, “No one in the field of infectious disease or public health can say they are surprised about the pandemic. What surprised me is just how quickly we gave up on the standard shoe-leather epidemiology,” and I’m sure to hear there is no regulatory intention by no means, I’m sorry.
My apologies. So, by that quote, I wanted to showcase that nowadays in this pandemic, data science has shown to be an important ally in predicting and helping the health authorities and the authorities with managing the pandemic and the Kaggle grandmasters will talk more about this. Being from a experimental scientific background in computational molecular biology, for me, one of the biggest, most interesting things, with regards to the data analysis, data modeling, was the comparison of models as statistical versus machine learning models in trying to model the disease and values of the outcomes relating to that.
And here, of course, I want to point out, I put an asterisk because as you can see there that the divide is not a strict divide, we use in data science so many tools that are best. And here, I have a chart just for those who are not so familiar with the divisions. So, it’s interesting for me to see models that are based on scientific prior knowledge of, most cases coursal knowledge, actually doing very well in predicting the disease, and then also seeing various models, more classical machine learning models using a variety of data, also being helpful perhaps in different use cases. And with that, I would like to kick it off and ask our Kaggle grandmasters about their experience in modeling COVID-19, the type of competitions they participated.
Okay. Great. Thanks a lot, Niki. That really helpful. Good morning everyone, so thanks. Good morning, good afternoon or good evening everyone depending on where you are.
Thanks a lot for taking part in this webinar. So, okay, I’ll start off from a data science standpoint and some of the things that we have worked on and some of the things that I have [inaudible], personally learned. Okay, so when send report, say like [aintel] so one of the major lessons that I learned at least from this thing is… Screen is visible, right?
Okay. So one of the major things that I learnt from this experiment or data [inaudible], one thing I repeat, turning for me personally at least is say somewhere data standpoint. We all know as a AI person, so we all know how important is data when it comes to any machine learning models or ML models. But in this case data is the… So as it’s rightly said, data is the new oil and only if we have good data then we’ll be able to build good AI systems in the first place.
So when this COVID pandemic started, one of the thing that I have come across is, there weren’t good datasets that were available. So data itself is a very big problem in the path [inaudible]. So one of the efforts that I be as part of the H2O has done is to create a repository for datasets. So as we could see here, we also have a open source repository called COVID-19 datasets where we had collated and put together some of the well-known open source datasets that have been there. So as we could see, JHU has done a very wonderful job on collating all these datasets at country level for affected persons at a daily level. And there is this COVID-19 Open Research dataset that has the information on all the papers that has come in. And there are a huge number of Kaggle datasets that are available as well. So in addition to some data from Our Data in World. So this repo COIVD-19 datasets will give an idea of the different things. So different opensource datasets that are currently available which we could make use of.
But one big learning, at least from this one, is moving forward, at least currently it seems to me more like it was not there already and it’s more of an ad-hoc thing and everyone was trying to put in all the datasets together at the end, as in when the pandemic hit. So one thing or one infrastructure that we should develop moving forward is to gather the data so both from a common standpoint of view as well as common organization standpoint of view. Of course there are a lot of concerns about data privacy, data security and all so that needs to be care of. But only if we have a good data in the first place then we’d be able to build any kind of a AI system. SO that’s one of the main, key takeaway for me from this exercise.
So after that just using the data we have also build some of the tools to explore some of these datasets. So if you’re new to H2O or say, this tool itself is quite new. So like H2O-Q is a product that we have built to build some AI apps. So we internally for our clients and for some of the organizations and non-profit ones. So we build some applications to help them for COVID. I’ll quickly show couple of them. Our other colleagues will also show few of them.
So this is the tool using which one interesting thing that we saw from this is mobility. So one thing like say, when something like the pandemic comes in we need to see how people move from one place to another place. So that we could estimate how the disease spread would occur. So this is one area where we could see. We have gotten a dataset from a third party data provider just to see how people moved from one county to another county in US. This one is specific to US but we could build such kind of things for other countries as well, if the correct data is available.
This one shows how people have moved over time. This particular dashboard is for your country. This is how the outgoing population from New York City to other counties. Generally it is huge and then there is a sudden reduction March second week kind of thing and then it fell down a lot, basically due to social distancing and other measures that have been put in place. This is with respect to the incoming population. We could see in this chart much more clearly. So this is how generally the movements are before the pandemic. And as soon as the onset of the pandemic we could see that people moving towards like along the highways to multiple different counties from New York, though the number of people moving has gone down.
So, say, once further social distancing measures are in place the number of people moving has come even further down but we could see that most of the people have settled in different counties. So these kinds of things will help assess where people have moved from one place to another so that this could be helpful for doing multiple different analysis, like further downstream analysis. So this is one thing.
And apart from the mobility we have also created a lot of dashboards to know about things. This is one of the dashboards we created for India. So there are a lot of good dashboards available. I don’t want to go into the details of this one. But one thing which as an AI person, we could do is we could do a simulation to understand how the progress will be. So this is some of the things what we have also done some of the modeling part to understand how our predictions are. These predictions are more like a simulations because when we need to do this we also have to provide a lot of information like how the growth rate would be. Here when I created the dashboard I need to give the forecast horizon and say what is the growth rate and how will be the growth rate decay and what is big acceleration kind of things.
This could be used as a simulation to understand how under various circumstances the predictions will be. So thanks to our colleagues and Kaggle grandmasters Marios, Dmitry, [Philip] and Rohan who has done a tremendous job in some of the Kaggle forecasting competitions where they have also gotten us a lot of ideas about how we can go about predicting some of these things. So, yeah, this is mostly what I wanted to say. And about the model building part I would like to request Rohan to take it forward from here and throw us some light on how the model building part can be done for the pandemic situation like this. Thank you.
Thanks SRK. Hi everyone. This is Rohan here. I work as a data scientist at H2O. I’m not a healthcare expert but I worked with some customers in the healthcare domain. And recently as part of the efforts of H2O in helping out with the COVID situation I spent time understanding how best we could forecast some of the elements for COVID. And Kaggle fortunately had objective competitions on this particular forecasting mechanism. So I spent five weeks on that. So they had one competition every week for five weeks where the challenge was to forecast the COVID metrics for four weeks ahead. I would like to take you through some of the aspects of those competitions.
So if you think about just forecasting COVID-19, there are lot of open questions in terms of, even before coming to the modeling part, it’s very challenging to set up the structure of the problem. So while it’s primarily a time-series based problem most models and solutions out there would structure it as time-series or regression. But I think one of the most primary questions before even getting to modeling is, what is the most important metric or what is the most important target variable that should be predicted. There are a lot of options, right. So you could either build a model to predict the number of positive cases or you could build a model to predict the number of deaths or fatalities or you could predict the number of beds you may require and so on. And it is subjective.
It could also vary across geographies. So, for example, maybe Singapore might want a specific, let’s say, number of positive cases may be more relevant for that country versus let’s say, another country in Europe, they would want to estimate the number of fatalities or deaths. Maybe some other country may want to get the number of severe cases to understand how many hospital beds would be required whereas maybe another country would want to forecast, let’s say , the number of testing kits that are required for COVID-19. And so there are lot of different options and structures that could be used. So specifically on the Kaggle competitions, they used the number of positive cases and the number of deaths.
Coming to the datasets, as SRK mentioned, it’s a very new problem and I think a lot of people around the world are sort of aggregating, compiling and sharing these datasets publicly and it’s amazing to see the amount of activity and information being made available. Kaggle gives that platform where you can actually use some of these and implement into the models and sort of verify and try out whether they are useful or not. So as part of the competitions we has to forecast these metrics for lot of different geographies. Primarily, across the five weeks it started with data on the lesser side. So we had to mainly predict at the country level. But at the end of the fourth week and the last week that was launched the data became much more granular. So we had data at the state level and even at a city level. SO with time the data is becoming more dense. There is more information available and there is one particular insight that I found most useful from this competitions. So let me just share that across.
I hope my screen is visible. So this is my solution on the competition and the other important aspect of modeling is what is the time horizon which is most useful to forecast. Like I said, is it the next three days, is it the next one week, is it the next four weeks and so on. And the output of the models and the importance of the features widely differ. So the graph that you see on the screen right now. This is a model built on predicting just the next day. So if you look at some of the features, it’s primarily the lag features and things like the number of days since the first case was observed, the tenth case was observed and so on. Whereas if you look at, let’s say, Model 14 is when you are forecasting two weeks ahead. So if you look at these variables, it has a mix of the lag variables as well as it has latitude, longitude, continent and population. So these are some metadata about the geographies that sort of become important when you are forecasting slightly more long term.
And finally when you are forecasting, let’s say, four weeks ahead, if you look at the top three variables none of them are the lag variables. So what this signifies is, what is happening today in terms of, let’s say, the cases or the deaths, it is not very predictable of, let’s say four weeks or five weeks ahead. So that is the reason this is still quite a hard problem to model. And iteratively with every week passing by with more data coming in, these models are becoming better, smarter and hopefully with time we should be able to get a really solid model which is able to model these epidemics much better.
So, one aspect is obviously the time horizon and the other is some of the very interesting insights that come from these models. One of the interesting aspects was, we the lockdown restrictions and mobility. These are very new and you don’t have too much of historic data or data using past epidemics that can be used to model this current epidemic. So that is the reason they are not very powerful predictors. And obviously, the life cycle of this particular epidemic significantly differs across geographies. So what we saw in China was very different from what we saw in Italy which is different from Taiwan and now which is now very different from what we are seeing in Singapore and India. So it’s also becoming important to sort of model these individually. So that’s sort of the trade-off between having sufficient data versus sort of breaking down the dataset into these smaller geographies and modeling them and optimizing them better.
So a lot of these model outputs and algorithms that we built we submitted it on the Kaggle competitions and it was quite great to see that the H2O team, we were able to finish in the top five in each of the five competitions. So that was a very objective validation that the work that we are doing and the model that we’re building, it is among the top models out there. And we have also integrated a lot of those models into our code platform and ecosystem.
So while these models are good to forecast and understand how the pandemic is moving, the other big aspect of it is, finally it impacts business. It impacts business in various different forms.
I’ll pass the mike to Shivam where he sort of explain some of the impact that COVID has made on different businesses.
Thanks Rohan. Thanks for sharing very useful insights about the modeling techniques that we have used. And as you rightly mentioned that, COVID is not just affecting the health industry but there are many second order effects of COVID and one of the, before I talk about some of the COVID-19 effects, I would like to talk about that all of the techniques that Rohan, SRK shared, what we are essentially doing is, we are building our own models to predict and forecast what is likely to happen across different geographies, not just in terms of number of cases but in terms of other business metrics as well. I would like to highlight one of the use case that we were doing for an e-commerce company. So now as we talked about the COVID has different effects in different businesses be it e-Commerce, retail, even banks, finance.
So what is happening essentially is that supply chains are disrupted. There are so many logistical issues, so many inventory issues that are going to affect due to this COVID cases and that effect is now clearly visible at different SKU level products, at different customers, at different regions et cetera. And now essentially if as an organization if we’re going to use the previous traditional forecasting models then we are not likely to get same results. For instance, this graph here shows, this black line in this graph is essentially the actual sales of one of the SKU for one of the e-Commerce company in Brazil. And what is essentially happening here is, their previous forecasting models is this blue line which is forecasting that the sales are not likely to go up during this period. But essentially the situation is different. There are actually COVID cases, which is this red line.
So due to COVID cases, the actual sales have gone up, which is this green line. What it essentially means that, now there is a need to incorporate a sensing information. Sensing information is essentially more real-time information and what we did essentially is, we created these demand-sensing models which can capture data from real-time and then they can adjust the previous forecasting models so that as a businesses they can understand what are their effects at daily level, what are their effects due to this COVID cases. There can be other information as well. For example, social sentiment because everyday due to this dynamic situation of COVID many things can change and one of them can be Twitter sentiment, let’s say.
And as a company, as an inventory management or as a supply chain organization we want to understand how our sales are likely to affected due to the sensing information and that’s where we created these models which we call as demand-sensing models which are different from forecasting models. And these models essentially tries to capture the local effects, the real-time effects and they try to in fact make adjustments by taking these real-time information and makes the predictions better. And then once we have these better models then as a business organization they can pinpoint to the effects that they are likely to see. For example, some of the SKUs might have a positive impact on the sales, some of the SKUs might have a negative impact on the sales, some of them can as high as 500% impact due to COVID. Some of them can have as low as, say, 10% or so.
So the idea here is that COVID situation can be very disrupting for some of the businesses and demand-sensing is one of the way that H2O is addressing these issues. That’s where we are building models’ specific capabilities so that individual businesses can also get useful insights to optimize their inventories, supply-chains et cetera.
Now one of the other area that has been very interesting and recently that has came up is the unsecured lending. So lending and mortgage industries likely to heavily effect due to the COVID situation. For example, in case of US, we are seeing a lot of jobless claims, we are seeing a lot of big spike in unemployment ratio. Now the thing is due to these signals, due to these real-time information, as a company will the previous models still say that these companies are likely to default on their loans or will they be fine, will there be no effect. So there is a need to augment the previous historical loans data. For example, this data which I’m showing is a historical loans data from Australian bank and what we essentially did, we augmented it with some of the useful real-time COVID signals. One of the chart that SRK shared was mobility. So we captured that data. What is the mobility percentage change in this particular region? Some of the regions have as low as 41% change, some of them has 33% change and then we have COVID signals, like how many COVID cases have happened in this region, cases per population et cetera.
So by looking at these real time signals again which are sensing variables, banks and organizations can adjust their previous loans model. And what essentially they can do is, they can then compare how their effects are likely to be in different scenarios. So what as a bank they can do is, they can run different simulations and scenarios, For example in scenario one, their expected loss was likely to about 6,000 but in scenario two it can be slightly lower $4,800 due to different simulations that they have run. And then they going to look at how their loans are likely to get affected due to increase in COVID cases, due to increasing jobless claims. Which regions, which states are likely to have more default rate?
So interesting point to note here is that these are not traditional models, these are more sensing models which not just take historical information to learn from the data but also captures in the real-time information like how the situation of COVID is changing. And now these kind of dashboards, these kind of results and insights can be very useful for business owners and banks and even the supply chain organizations to understand what optimist steps they can do. Can they can they focus on particular areas? For example, in case of these loans, can they focus on a particular group and do something, maybe increase or decrease their interest rate just to ensure that they do not default as such.
So that’s where COVID is not just affecting healthcare but the effect can be seen on multiple industries and there is a need where H2O is directly addressing it to create these more real-time sensing information. And then there are numerous industries. For example, insurance has, there was a report from Mackenzie that shared there are about 50 such cases that are likely to affected due to COVID in case of insurance. So that’s where as a data scientist and AI expert, we also need to focus on how we are capturing things like drift. Are our population that we use to train a remodels earlier is different from now? Which is obviously yes, due to COVID. So we may not rely on previous models, past models. So with all the complex techniques we also need to keep these nuances while addressing these issues as such. So this is something that I wanted to highlight. I would like to pass SK.
Great. Thank you Shivam. We have a number of questions that I we’d like to go over. Let’s just dive right into them. The first question asks, what are the top healthcare challenges currently being addressed by AI?
Yeah. I think I can take that, if that’s okay.
Absolutely. Yes. Go ahead.
Okay. So top healthcare challenge I think is actually COVID-19. This is a healthcare challenge, this is a societal challenge, it’s a production challenge and the AI is helping with that. This is all assuming through the action, the former chief of workers organization and knowledge is how now new techniques, modeling approaches are seriously helping with COVID-19 as well as the panelists just explain through their work in H2O and in Kaggle competitions. In the bigger picture, where we were before COVID-19, there are a few fields that actually Healthcare was already utilizing AI. And here with healthcare we have to think that it is not only the bedside, the diagnostic side which, of course, I do want to talk a lot about. It’s an important field and we need more AI in that area.
But we have the healthcare ecosystem, we have insurance claims, we have ordering, pharmacy filling orders and then, of course, we have even more specialized. We have the area of personalized medicine that is very little far away from that but this is a great goal to achieve with AI. We have robot-assist surgery where engineering, mechanical engineering together with data engineering making miracles as we all seen with us. And of course, robotics in surgery and now with telehealth, we all became more familiar during COVID-19 with telehealth. Robotic-assisted surgery could definitely enhance the experience. Of course, you have others. Clinical trial participation, it’s an important one because of how hard it is got, and how time-consuming it is to perform clinical trials. Drug development, perhaps later on we have more questions about that. But I just wanted to give an outline about the values. Missed appointments, huge use case there. So, yeah. There are definitely a lot of problems in healthcare that are being already tackled with AI, could use more. Or the opportunities just they are waiting. Thank you.
Thank you Niki. Before we go on to the next question, I’d just like to point out that we’re taking questions on slido.com. If our attendees can go to slido.com and use #lessons, you’ll find all the questions that we have received so far and they’re bucketed by themes. If, of course, you have more questions you can enter them there. You also have the option to upvote a particular question.
Now, let’s get the next question up in front of our panelists. How advanced is AI to overtake a physical doctor or help augment him make better decisions? This question’s around trust in AI models. Would you like to take a stab at it Rohan?
Sure. So if the expectation is for AI to completely replace a doctor, to see a world where there are no doctors, there are only AI models and engines that are doing it, I think we’re quite far away from that. But I think AI is adding a lot of value to the entire healthcare process in a lot of different ways. As a data scientist, the way I see it is, it adds ammunition to a doctor’s decision-making. So, for example, a few years back, it could be either for prescription or for diagnosis or maybe even for policy-making. The people involved, the healthcare professionals they would look at few data points or few aspects to make those decisions. But, I think, AI has given a path where you can see a lot more different kinds of outputs or data points and some of these results are now reaching the professional at a much faster rate.
So, for example, you could have an image, let’s say a scanned image or an X-ray image and you could get 50 or 100 data points and the model could pinpoint what are the most important aspects of the outliers of the data and using that the doctor can make better and sort of improve their decision. So, I think one of the biggest challenges, and maybe Niki would know better is, the way I see it is accountability. And I don’t think humans are really ready for having a complete automated system be accountable for all decisions in the medical domain. So while I think it adds a lot of value and helping hand I think we are still some time away with a complete replacement.
Also, Rohan, since you mentioned me, let me add that it’s exactly as you say. Let me add as well my opinion is that it is not even necessary to have complete independence of AI from physician. At this point until the way in typical AI fashion, facilitating workflows, eliminating, perhaps if the AI is helping with certain diagnosis or monitoring a patient for predictive purposes, for example, with sepsis we have this example. A goal could be also eliminate the patients with low risk for the disease. So the physician can focus on the patients that need him more. Obviously, the goal is to have independent AI but also I’m going to say that it’s not necessary at the initial stages and this is the approach that seems to be taken by currently approved diagnostic AI by FDA.
So thanks Rohan and Niki. Those are very valuable points. So I’ll also second both of them in saying that complete AI-driven [inaudible] is very far away, at least that’s what I think. But there are some real good examples where AI can help in a lot of places. So Rohan mentioned about the image problem. Since I’m working on the NLP side, so one thing with respect to NLP there are a lot of research papers that have been coming up in the recent with respect to the pandemic. And the AI systems are helping the researchers as well as the healthcare experts to go through all these research papers to come up with a faster solution. For example, if drug discovery had been taking a lot of time previously because of a lot of mundane process that’s involved but AI is helping them accelerate the process. So human-in-the-lop AI is what I think will be there in the healthcare system and a complete AI-driven healthcare system is very far away as Niki and Rohan mentioned.
And also healthcare is a very sensitive area. It directly is connected to patients health. So essentially there is a need for real doctors but AI can definitely help them to get some more insights, maybe more data-driven insights which they can blend with their knowledge and their experiences of years and then they can make more informed decisions. And then of course there are operational and non-operational sides in healthcare as well. So if there are a lot of patients coming, so as Niki mentioned, maybe one initial filter can be developed using an AI which can rank the individual, so that doctors can prioritize. They’re not ignoring anyone, they’re just prioritizing better. So these type of insights can be helpful for real doctors.
Thank you everybody. Next question asks, how can AI technology meet the need for contact tracing while protecting personal privacy, during the crisis and thereafter the crisis?
I would like to take a stab on this question. So contact tracing is one of the most important in this current crisis situation and some of the countries, including Singapore, has been very focused on contact tracing. But this question is very interesting because one of the prime doubt that every individual has is that, are these contact tracing applications models are also incorporating our personal data and can they be used in some other manner as such. So, firstly, we have to look at it from a different perspective, essentially the contact tracing can be thought of as a network problem in which every individual is a node and the goal is to connect these nodes via some idea, via some intuition.
So the way some of the countries have been connecting these nodes is, one like Singapore is doing by identifying if the two people are close enough, they are less than 10 meters then there exists a edge between them. In some other cases it can be whether those two people are family members or have they met in past seven days or 10 days. So the idea is that while building this network, we are not capturing the personal information but rather the usage or the behavior of those nodes or those individual. And now every mode can be thought of as a initial unique identifier which just has an idea of the individual but don’t have any other information and that’s why these contact tracing apps needs to be driven by governments because they are the ones who maintains these government databases and they work at a unique ID level, make those networks and whenever they identify a node which is likely to be affected, which seems to be suspicious node then they can dig deeper and then contact the individual directly via phone and maintaining the privacy of that individual as such.
So since it also talk about data privacy, it’s definitely like Shivam has given very valuable points over there about the contact tracing apps and data privacy. So as Shivam mentioned, our governments are taking care of the privacy issues and things. But say as a whole infrastructure as we move forward, in general data privacy or data security is a very big thing that has been going on and moving forward we need to build infrastructure where we need to have good processes on data privacy and data security things because that will be inevitable for coping up with any future healthcare, epidemics or pandemics. So that’s one thing which I wanted to add.
Thank you. Our next question asks us, how are drug databases searched and appropriate drug matches found as tentative cures using AI for a particular disease?
Okay. Great. That’s a good question. So actually from a NLP standpoint of view, even in this COVID, there has been a lot of tools and a lot of datasets that has been out there and we could do some of this things. So with the advent of NLP, let’s say now NLP has grown a lot in the last two years and then especially in the healthcare space there are some good innovations like BioBERT. So BERT is a recent one, state-of-the-art thing in NLP. And models like BioBERT can be used to search for drugs in the vast drug database to come up with the best solutions. And there are also tools like SpaCY where we could also do a named entity recognition kind of thing or even we can build some things on our own tools. Look at the drug database and get some good results. From an already existing standpoint to start with I’ve seen couple of things. So let me share my screen if that is okay.
Yes, absolutely. Just give me one second.
Yeah. Yes. Great. Thanks. So as you could see this is a Open Research dataset that has been put together by Allen AI that has the research text on a lot of research papers. If you want to do exploration this is a CORD explorer that has been created by Allen AI where we could go and search for some of the drugs that have been, I mean, you could key in our related terms and we could see some of the drugs that come up. So this is one thing that is readily available.
But say if we want to build something for our own dataset and if we want to do something for our own thing, this one, this is a BioMedSanity.com. So this has been built by Andrej Karpathy, he is a well-known figure in ML world. He has built a very simple system which could be enhanced to get a much more complex system. The code is also available in GitHub so you could have a look at it and see it could be help to search the databases and get relevant information from them. Thank you.
So basically the goal is to… Niki let me just add one point. What SRK shares is very, very useful. So it’s a gold mine of all the research papers, journals that have been produced in the past to also show some effects of previous flus, not pandemic essentially but some disease-like situations. And the goal is to use this contextual similarity model in order to understand, can we identify some similar situation or similar scenario which maybe, say 10%, 20% aligned with what COVID-19 is and then by blending all those patterns that we have obtained from different journals, different papers and then come up with possible candidates that can help in the drug discovery process.
Thank you. I only will add very, very quickly that this is a very, very important use case of AI for drug discovery. How to actually mine the literature because drug discovery, and with the COVID-19 datasets that our Kaggle grandmasters are working with in Kaggle competition, it’s a very important example. So I just want to go one step back and say that together with all of them many ways we can leverage with machine learning for drug discovery and we relying similarly on known information, prior research base of known structures. The idea is with the drug you’re going to attack part of the virus and [inaudible]it in so the virus cannot attack. This type of interactions we have an idea of how this work in different viruses. So we try to mine this historical data and the characteristics of the chemicals in the databases you mentioned. All I want to say is that together with the NLP methods which actually hold great promise for de novo discovery. We do have classification, regression, all these approaches that can lead to drug discovery in general.
Thank you Niki. Next relevant question we have asks us, how can densely populated countries like India adapt AI in the healthcare industry?
Let me take a shot at that, being from India, currently experiencing how COVID is impacting us. So I think India is steadily adapting lot of different technologies and solutions out there. So to quickly just answer that in one statement, I would like to share a personal experience of mine. Just this week I ordered some food to get delivered to my house and in the app it showed me things like, it showed me the hygiene score of the restaurant, it showed me the hygiene score of the area surrounding the restaurant, it showed me the freshness of the vegetables used for that dish and it showed me the real-time temperature of the delivery executive delivering the food. And all this is only possible if the AI system is going really deep into the solutions, or the apps or whatever is the use case. And it has now started reaching and touching the end users and customers and people living in different parts of the country. So while there is so much of population and yes, there are challenges in getting this live. I think there is a lot of progress happening.
Great. Thank you Rohan. Next question kind of targets a particular use case. How can we leverage AI to estimate year-on-year nurses required in a medical unit, for a particular year or for a number of years?
So for this type of a use case where the idea is that we want to predict in a sequential manner what is likely to happen next month, next order or next year and also the results might be very different for different geographies, for different regions. So the goal should be to create hybrid AI machine learning models which consist of some typical time-series forecasting models which also involves some demand sensing models like that which incorporates real-time information about different geographies and about some different past data of those geographies. And then also, since this is more like an estimation problem, it cannot be just one number that we give. It needs to be optimized with respect to the constraints. So every region, every hospital and every location may have different constraints as in there may be some government regulations, there may be some hospital level constraints.
So once we have these forecasting and then demand sensing model outputs, we should run some optimization techniques which can be simple linear optimization or which can be complex integer optimization. And then also because the situation of COVID is something new, we don’t know what is likely to happen even after five years. We need to also run scenario-based modeling in which we can get these results for different scenarios. So by producing these hybrid models AI organizations and even the healthcare organization can get some sense of what are, say, nurses required year-on-year for different geographies as such. But again with a certain confidence level because all of these models will have some level of noise because of lack of huge historical data which we don’t have. So with a certain noise and confidence one can predict.
Thank you Shivam. We have time for one more question before we take a concluding question. This gentleman asks us, how to identify features to be used for changing models to make it suitable for making predictions due to the effect of the pandemic? I mean how can models adapt to the new data, the data that has not been available for the last century. How can we quickly adapt and pivot?
Okay. Yeah, I’ll go first and then let others add on top of that. So the very first thing even for us to understand is to understand the drift in the data as well as in the model. So that’s first, to see how much the data has drifted due to this kind of a pandemic situation. There will be few areas which will be heavily affected and few of them not so much affected. So first understanding the drift and then if there exists a drift then we need to, again, as Shivam also showed in one of the demand sensing apps, are we able to find out any second order interaction that we could use in addition to the existing features that could help in estimating the future model predictions. For example, say, if the number of COVID cases can be a proxy and if that helps in understanding that extra dip that we get in the models of the data paths, as in like the model prediction for. So that’s one thing or if you could use the mobility information in addition to the existing features to make use of that one.
So essentially we need to see if there are some alternative adjustments that we could use in the features to overcome the drift. And one more thing would be to see if we could rebuild the models, probably if we were building models only once or twice a year previously. Now due to this situation if we’re not able to find out other features then we might have to rebuild it much more often. That’s one other thing that comes to the top of my mind. So understanding the drift and taking care of the drift through some of the other features might be one idea. I will let Rohan and Shivam add if they have some points.
Yeah, just to quickly add to that from my experience on working with these datasets, in one very important and useful method is to break down the data across geographies to test features. So, for example, there was a new mobility-based features that I was trying out. That feature seemed to work for seven countries, didn’t seem to work for other countries and geographies. So that helps in understanding which feature you can use with which geography. And it does make significant difference in different countries or geographies.
Also want to add that some artifacts from machine learning interpretability can also be used to identify which features are likely to be some of the useful signals for next models. So one way we can do is, we can either do the retraining or the other way is refitting. So if we retrain the same models and see if there is a difference in, say, feature importances or the shapley values, then it can give some hints as in which features needs to be given more focus and which needs to be tweaked. And in case of refitting, if we see some new features that are coming up on the top as one of the most important features, so then they can give a sense of, these are the features that we need to focus more as such.
Okay. We are top of the hour but I’ll take one quick question, the closing question, what is the future of machine learning in a post-COVID world? One quick statement from all of our panelists, please.
Perhaps I can start since I started the discussion. Thank you. So the future of machine learning after COVID-19, I think, is good. If anything just the simple thing with digital transformation has been accelerated whenever was not already completed. Because of COVID-19 we’re now attending this webinar via our computers, we’re doing our work via computers, we are using models to help all the countries, to help everyone to help against the pandemic. So I think the future is good for AI.
So, yeah. Thanks Niki. From a healthcare standpoint of view, AI will become inevitable after the pandemic and specifically I feel that there will be more focus towards data privacy, security and also one another very important point which Shivam slightly touched upon towards the end, explainability. So explainability will also play a major role in post-COVID to help the physicians and researchers. Yeah. Thanks.
And then AI and machine learning will going to play a very, very important role be it, say, healthcare use cases or say, contact tracing, forecasting, nurses or it maybe industry-specific use cases to help the stakeholders understand some of the effects that they are likely to see. One of the comparison of COVID-19 that almost everyone makes is the 1918 Spanish Flu and one of the difference that time was that this complex level of AI was not available that time. So hopefully this time AI can show its better impact because the levels in AI are completely different and we may see some different effects in the coming years.
Yeah and just to sum up all these points, I think when we look at the future and hopefully very soon start living the post-COVID era, I think most of the learnings and when we as a world together be better prepared to tackling these kind of situations, they’re all going to be outputs of the ML models and the research and the analysis that has been done today.
Great. Thank you everybody for taking the time today on our panel and sharing your insights and experience with us. I’d also like to say thank you to everyone who joined us today. The panel’s recording will be sent in an email shortly. Have a great rest of your day. Thank you.