March 30th, 2020
Modelling Currently Infected Cases of COVID-19 Using H2O Driverless AIRSS Share Category: AI4Good, Driverless AI, Explainable AI, GLM, Healthcare, Machine Learning, Machine Learning Interpretability, Responsible AI, Technical, Time Series
By: Marios Michailidis
In response to the wake of the pandemic called COVID-19, H2O.ai organized a panel discussion to cover AI in healthcare, and some best practices to put in place in order to achieve better outcomes. The attendees had many questions that we did not have the time to cover thoroughly throughout the course of that 1-hour discussion. We hope that with a series of blogs covering this very important topic, we will be able to shed more light into the different elements the attendees were most interested in. Starting with a blog that explains some of the freely-available data sources we used to create a beds and ICU forecasting application for different US states. Another question that was brought up was whether H2O.ai has created an epidemiological model of COVID-19 and whether we can share more information about it.
The purpose of this blog is to share our experience modelling the currently infected cases for different countries and US states. Before we get into the specifics, we need to highlight that there are a couple of elements that make this type of modelling particularly challenging:
- Any errors made are augmented from the fact that this is a very sensitive topic. Generally errors in healthcare predictions could result in more than monetary losses. Therefore elements like interpretability, domain knowledge and responsible AI become increasingly important (or better, necessary) for this type of modelling. It is not just about reducing an error (or maximizing accuracy) in whatever validation/test strategy somebody might have.
- The available data sources are not perfect. This is not unexpected, most of the times the data scientist has to clean and prepare the data and here is no different. Aside from filling in missing values and/or removing extreme values, you also have to correct for wrong reporting and things like the number of total/cumulative deceased people decreasing(!), table entries/countries changing names over time, unexpected change in trends just to name a few.
- The data is not much. By this, I mean that there are still many details that are not known for this virus, because this is the first time we are encountering it. For example, what is the probability of resurgence or what will be the exact impact of different strategies applied (if eventually applied) to mitigate the spreading or when will the vaccine be created and what impact will it have. In most cases, we just have 30-40 historical data points to predict the next 2-3 months where a lot can happen in between (i.e a vaccine might be created).
In light of the above, we have modelled primarily short forecast horizons between 3 to 7 days for the currently infected COVID-19 cases at country level (with the exception of a few countries that also have a province/state breakdown as well). Another thing to highlight is that we will share what has worked in our modelling based on how the problem was framed. We do not claim this is the only way to frame/model this problem. Many other approaches could be tried as well. The following sections will outline the data sources used, the data preparation, how H2O Driverless AI was set up, the expert settings used, the modeling approaches considered and presentation.
Data Source and Preparation
Like many other data scientists tackling this problem right now, the main source for the modelling has been the data repository of theJohns Hopkins University Center for Systems Science and Engineering. It contains information about confirmed, deceased and recovered cases at country level as well as some basic geographical features (eg. latitude and longitude). It also contains some of this information (excluding recoveries) for state level and even for county level for the US. Something that has proven somewhat useful was to create an additional feature which is just the cluster id after fitting Kmeans clustering on latitude and longitude. The reason why the clustering works is because some areas behave similarly (for instance many provinces in and around China). The exact parameters were the following:
from sklearn.cluster import KMeans
clusterer = KMeans(n_clusters=30, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=1234, copy_x=True, n_jobs=None, algorithm=’auto’)
X[“Distance_cluster”]=clusterer.fit_predict(X[ [“Lat”, “Long”] ].values)
The data was brought into the following format:
Where currently Infected= Confirmed – deaths – recovered (all cumulative).
It is imperative to clean this data, on many occasions the same value is reported between dates (which is unlikely) or even the (cumulative) number drops. For the purposes of making a more robust model:
- When any of the Confirmed, Deaths drop or increase-and-then-immediately-drop, they are replaced with the same value as previous.
- Example of case 1:
- Example of case 2:
- Example of case 1:
- When the value remains near the same and then “explodes,” the value is replaced with a linear interpolation (e.g., the average between the previous value and the next value).
The data would be transformed from this:
The assumption behind these corrections is that the reporting is wrong or non-updated for specific records as you would expect these fields to be increasingly monotonic. Making those corrections improves the fitting of the models.
The same treatment is applied to DAI predictions as post-processing. There were cases that due to fairly flat lines from the input data, they caused a similar pattern to the predictions (where for example 2 of the predictions move high and the middle one stays low), hence replacing those cases with a linear interpolation improves the forecast.
What is H2O Driverless AI?
For those who might not know this, H2O Driverless AI< (abbreviated as DAI) is an artificial intelligence (AI) platform for automatic machine learning. It can handle many types of data science problems (like classification and regression), including time series. For a quick tutorial on time series, consider the following.
The time series recipe of DAI has been designed to handle multi(ple)-series problems. That is one can provide multiple single series that are separated by a key/group or multiple keys/groups (like the countries and/or the geographical clusters in this case) and a date field. Features like lags, moving averages, exponential smoothing or single-series models (like ARIMA) can be automatically generated for any given series and any point in time. Then models are fit on the whole data (looking at all series at the same time).
For example, taking a simplified scenario where we model the confirmed cases with only a date and a group column(country), lag of 1 and lag of 2 of confirmed cases would be calculated for each individual point in each one of the series like you see below:
Then the “new frame” is actually the one being used to predict the target (i.e confirmed). A list of all possible transformations including the time series’ related ones can be seen here. Nan values are handled based on the model that is selected to fit on these values, which could be a generalized linear model, a random forest, a gradient boosting machine, a deep learning model or other. Many different models will be tried automatically.The multi-series approach has worked better in our internal tests compared to single series’ approaches – albeit we did not spend too much time with the latter nor did we tune them too much. It seems that the model can benefit from being able to ”see” series which might be in completely different phases. Then it has “knowledge” of series that have gone up and down, otherwise single series’ models like or Prophet had the tendency to remain monotonic throughout the whole forecast horizon and were failing for instance to capture South Korea’s recovery shift until it was well into its downward trajectory.
A last component to mention is that DAI will automatically “draw” multiple different lines, utilizing many different lag combinations and lengths in order to find a good set of auto-regressive (and other) features to lower the error. There is a genetic algorithm that controls this process that progressively tries to find smarter features and better models to achieve lower errors. In a times series’ context, the decisions are evaluated through a rolling window validation format (where models fit with older data and are being tested with future data). The number of validation splits, type of models evaluated, the time spent on models’ hyper parameter tuning, feature engineering (and searching) is controlled by the parameters Accuracy, Time and Interpretability. Setting a higher Interpretability means putting higher intensity into making the model more explainable (potentially at the loss of some accuracy) via limiting interactions between features, prioritizing simpler models and more:
H2O Driverless AI specific settings
Typical Time Series Settings
DAI allows the user to provide their own recipe through python code (with an API that resembles ’s), therefore on top of the general time series component, the user can add his/her own domain knowledge through models, objective functions(scores) and feature transformations. There is an open source repository with a sample of these recipes here. In addition DAI provides some EXPERT SETTINGS that give more control to the user to set more refined or use-case specific settings in order to get better results (faster). This is what the experiment setup looks like:
- The name of the training file.
- Name of the target variable – what we want to predict. Remember Infected= Confirmed – Deaths – Recovered.
- The date field. It is required to trigger the time series recipe.
- The time group columns. These help define the unique times which are in the data. although Distance_cluster does not further separate the series, it can help in different ways which will be explained later on.
- This is the forecast horizon. For this experiment it is 3 days (e.g., predicting 3 days ahead).
- Features that are unavailable at future time. These will be Confirmed, Deaths and Recovered. By including these, we tell DAI that it cannot use them even if present in the test data, but it is perfectly fine to create time series features from them (like lags of confirmed).
- This is the metric to optimise. Normally you could see RMSE, RMSLE, MAE or another popular regression metric (check our guide for when it is best to use each metric). However, when using RMSE, I have found it useful to penalize underestimation more than overestimation
This is an early model evaluated on Standard RMSE for Spain:
And here the model with the modified metric to account for underestimation :
The overall metric monitored in the end was MAE. We prefered to model on actual counts and not in log space. While log space can help scale the series to more similar levels, making it more interesting from a modelling point of view, a life is a life and we chose not to take away the intensity from states/countries with more actual cases. Having said that, setting the metric to RMSLE would automatically transform the target to log(Infected+1) and model in that space.
Further areas of Improvement
Before sharing what has worked well for this problem, it might be useful to mention a couple of things that may have potential, but either could not make them work due to limited time or did not have the time to try them at all within the context of this DAI experiment.
AUTO-ARIMA and Prophet recipes
Through the custom recipe functionality, one can add additional models, features or scorers/objectives. For many of our time series problems these two recipes provide good additional value and as single-series models, blend well with our multi-time-series approach. In this particular case, adding them on top of the automated feature engineering that occurs naturally within DAI, they did not provide extra value. Some of the best parameters found (given the selected forecast horizon) for the auto-arima were converging towards these values:
stepwise_model = auto_arima(datafr, start_p=1, start_q=1,
max_p=3, max_q=3, m=6,
d=1, D=1, trace=True,
Features that are unavailable at future time
Although Confirmed, Deaths and Recovered cases were added as potential fields to derive lags (or other features) from them, the genetic algorithm did not find them useful and the final set of features does not contain any that come from them. DAI gives priority to features derived from the target variable, so potentially changing that priority from the EXPERT SETTINGS→ Time Series, would give them a higher chance to be trialled, but based on the default values they did not seem to add new information in predicting the currently Infected better.
The SEIR Model
Domain knowledge is essential to tackle this problem better. Our team has worked tirelessly to bring in algorithms that are specifically designed to tackle epidemiological modelling. We have very recently added the SEIR model. The SEIR model belongs to a family of epidemiological models (including SIR, SEIS, MSEIR) that maps the spread of an epidemic through the sequential interaction of 4 groups (represented as 4 ordinary differential equations), the Susceptible (or number of individuals that can contract the disease), Exposed, Infected and Removed. Given an initial population of size N = S+E+I+R, a forecast period and a set of (4) fitting parameters that control the rate at which individuals jump from one stage to another and a mortality rate one can simulate the shape/size of these 4 groups across the forecast horizon. For more information about the SEIR model, there is an excellent interactive simulation here (and another interesting article analysing social distancing through SEIR).
A typical graph (retrieved from here) mapping the 4 groups would look like:
The very intriguing part about this representation/model is that it does not require historical data (in the time series sense) to make predictions, but an understanding of the dynamics of the epidemic to spread through the flow of these 4 groups. Therefore, with an understanding of the potential size of the population that could be affected, the infection rate and the mortality rate one could estimate the number of Infected individuals as a function of time. However, finding these numbers accurately in different geographical areas is not easy and domain knowledge from epidemiologists may be required to ensure the best fit. By default, DAI will assume some reasonable but generic bounds for each of the fitting parameters, and find the best fit among all possible curves. The user is encouraged to provide narrow bounds for each of the fitting parameters to control the epidemiological part of the model.
Within DAI the SEIR model can be activated with the press of a button from within the EXPERT SETTINGS and is used as a detrending mechanism:
Detrending in Driverless AI is a family of transformations that apply to the target variable in order to remove the trend and help the models extrapolate better by separating the trend from the residuals, and letting the inner machine learning model focus solely on the residuals. Other examples of trend models are linear trends or logistic trends. Note that in DAI, for large datasets of multiple time series, a separate detrending model (such as SEIR) is fitted on each time series automatically. After finding the SEIR model with the lowest error on the training data, we then calculate the residual (e.g., the Infected minus the predicted/projected from the SEIR model).
Consequently, we use the residuals from the SEIR model as the new target for the machine learning model such as GLM, NN, RF or GBM. The logic that selects best features, models and hyper parameters remains the same from here on – the only difference is that decisions are based on reducing the error in predicting the transformed target (i.d., the residuals) rather than the original one (trend + residual). The inverse transformation is applied to the validation and test data to get the final predictions (i.e., adding back the SEIR model’s prediction to the modeled residuals).
This approach showed promising results. However, we decided not to include it because it is not available in the current version (but it will get released soon). To get an idea of what is happening under the hood, these are different projections of Infected for California assuming different epidemiologic parameters such as population size N or rates of exposure, recovery and of getting infected:
This highlights the need for expert control of the fitting parameters through as narrow bounds as possible.
GLM is all you need
Allowing Driverless AI to loop through all models is a waste of time in this case, GLM is eventually giving the best results, possibly for its ability to extrapolate on new territories of the target distribution. Feature engineering is more important than the underlying model. From the EXPERT SETTINGS — MODEL, all other models but GLM are disabled:
Enable lags for all intermidiate groups
From the EXPERT SETTINGS — TIME SERIES disable the following:
When this is enabled, DAI will only consider lags from the combination of all groups, in this case, key and Distance_cluster. When this is disabled (like above), DAI will search for lags for any of the intermediate groups (for example, it could search for moving averages across all keys from within a Distance_cluster).
Enable smallest lag possible
By default, DAI has a natural avoidance over very small lag sizes, because models tend to over rely on them causing overfitting. This does not happen in this type of problem, setting the smallest lag possible to 1 has consistently given better results.
Disable Date features and holidays
It does not make a huge difference, but date features (like “Monday”, “March”, “1st quarter”) and holiday features could be disabled as they did not seem to add much value, (one less family of features for the genetic algorithm to waste time on). They can be deselected from EXPERT SETTINGS – RECIPES – Include specific transformers.
High accuracy, enough time and interpretability
A good set of experiment parameters was 10-5-6:
The biggest benefit for having the highest accuracy is that the number of rolling-window splits tried will be set to maximum ensuring a more thorough validation strategy. The model will be plateauing after 40-50 iterations, therefore time 5,6 would be ideal, more than that will just take more time with no much additional benefit. Interpretability of 6 ensures that some complicated features are disabled.
The model is evaluated on 21 days (or 7 periods of 3 days each ) from 3/9/20 to 3/29/20, with the 1st period being 3/9/20-3/11/20, the 2nd 3/12/20-3/14/20 and so forth. DAI does automatic backtesting, refitting the model with older data and evaluating on future data with horizon equal to the selected one (which in this case is 3 days).
The final results in some of the most popular metrics are the following:
The metrics signify a strong model. A MAE error of 88 seems quite low if you consider that some of the series were expanding to hundreds of thousands of current infected cases within that timespan. The overall error is close to 10% for the whole period of the 21 days.
The final model was a Generalized linear model with L2 and L1 regularization terms to control for overfitting (similar to ElasticNet).
The actual versus predicted graph shows that error-wise our model has a slight preference to overestimation (which was the initial goal) compared to underestimation:
This is the actual versus predicted throughout the period of the 21 days the for the country with the highest number of of active cases (the US):
Here are some projections for the next 3 days (3/30,3/31 amd 4/1) for the four countries with the highest current Infected count, displayed using our upcoming H2O Q , a new AI platform to build AI applications which is still in early release stages:
Other projections from multiple countries overlaid along with some flags, highlighting the highest expected proportional increases:
In total DAI evaluated more than 1,000 potential features in about 40 minutes in a 16-core CPU-only box with 64 GB of memory. The final model has 109. These are the top 12 features found throughout the evolutionary process along with the relative importance. The long lists of values signify lag values (in days) based on a given group or groups (which would be either ‘key’ or ‘Distance_cluster’ or both):
|1||Max of target lags [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 51, 52, 53, 54] grouped by [‘key’]||Lags||1|
|2||EWMA (α = 0.05) of target lags [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32] grouped by [‘Distance_cluster’, ‘key’]||Exponential Weighted Moving Average||0.7714|
|3||EWMA (α = 0.05) of differentiated target lags [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32] grouped by [‘Distance_cluster’, ‘key’]||Exponential Weighted Moving Average||0.6612|
|4||Std of target lags [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 51, 52, 53, 54] grouped by [‘key’]||Lags||0.5533|
|5||Mean of target lags [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 51, 52, 53, 54] grouped by [‘key’]||Lags||0.5093|
|6||EWMA (α = 0.05) of target lags [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 51, 52, 53, 54] grouped by [‘Distance_cluster’, ‘key’]||Exponential Weighted Moving Average||0.4569|
|7||Lag of target for groups [‘Distance_cluster’, ‘key’] by 3 time periods||Lags||0.3101|
|8||Mean of target lags [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 51, 52, 53, 54] grouped by [‘Distance_cluster’]||Lags||0.2701|
|9||Lag of target for groups [‘Distance_cluster’] by 3 time periods||Lags||0.2505|
|10||Lag of target for groups [‘key’] by 14 time periods||Lags||0.1733|
|11||Lag of target for groups [‘Distance_cluster’] by 13 time periods||Lags||0.139|
|12||EWMA (α = 0.05) of target lags [14, 28, 42] grouped by [‘key’]||Exponential Weighted Moving Average||0.1388|
A few thoughts
The top feature is an interesting one and a product of the custom metric selected. It takes the maximum past target value over a selection of possible lag sizes, the majority of which within the last 17 days. Given the model was prone to underestimate, that feature puts intensity towards alleviating that effect.
4 out of the top 12 features connote exponential weighted moving averages with different decay factors and different selections of lag sizes on original or differentiated lag values of the target. Collectively, these types of features seem to be working really well for this problem.
Lags of about 2 weeks
The genetic algorithm took a liking to exactly what happened around two weeks ago at either country level or geographical level (see features ranked 10-11). Interestingly, 14 days is about the maximum of the incubation period for this virus.
Other window aggregated measures
Moving averages and standard deviations of different lag sizes comprised the rest of the list.
We at H2O.ai are committed to help the world battle this crisis through our expertise in the AI space. This blog showcases how you could use H2O Driverless AI to model short horizons of the pandemic through domain-specific models and algorithms in tandem with common machine learning methods and time series-based feature engineering. The findings could easily be extracted and applied to any other analysis. Domain knowledge can easily be applied (and is encouraged) to achieve even better results. If you wish to reach out to us and discuss about convid-19 or any other ML challenge, you can always find us in the community slack channel (look out for the covid-19-specific channel). In the meantime, stay safe and we will get through this together.