July 8th, 2019
Toward AutoML for Regulated Industry with H2O Driverless AIRSS Share Category: AutoML, Data Science, Explainable AI, H2O Driverless AI, Machine Learning Interpretability
By: Navdeep Gill
Predictive models in financial services must comply with a complex regime of regulations including the Equal Credit Opportunity Act (ECOA), the Fair Credit Reporting Act (FCRA), and the Federal Reserve’s S.R. 11-7 Guidance on Model Risk Management. Among many other requirements, these and other applicable regulations stipulate predictive models must be interpretable, exhibit minimal disparate impact, be carefully documented, and be carefully monitored. Can the productivity and accuracy advantages of new automatic machine learning (AutoML) predictive modeling technologies be leveraged in these highly regulated spaces? While H2O cannot provide compliance advice, we think the answer is likely “yes”. Here’s some quick pointers on how you could get started using H2O Driverless AI.
Interpretable and Constrained Models
H2O Driverless AI is an AutoML system that visualizes data, engineers features, trains models, and explains models all with minimal user input. AutoML systems are great for boosting productivity and maximizing accuracy because they try exhaustive sets of features, modeling algorithms, and hyperparameters that human data scientists usually just don’t have time to consider. Because of their trial-and-error abilities, AutoML systems can create extremely complex models, and all that complexity can be a barrier to interpretability. Fortunately, Driverless AI provides a number of ways for users to take back control and train interpretable models.
Constraining Interactions and Feature Engineering
Figure 1 displays main system settings that, when combined with expert settings like those in Figure 2, will train a monotonic and reproducible model. Reproducibility, often a fundamentally necessary quality of regulated models, is ensured by clicking the REPRODUCIBLE button highlighted in Figure 1. The INTERPRETABILITY knob, also highlighted in Figure 1, is the main control for the simplicity or complexity of automatic feature engineering in Driverless AI. Check out the most recent INTERPRETABILITY documentation to see exactly how INTERPRETABILITY settings affect feature engineering. Also, when the INTERPRETABILITY knob is set to 7 or higher, monotonicity constraints are used in XGBoost. Monotonicity means that as an input feature’s value increases, the output of the model can only increase or as an input feature’s value increases the model output can only decrease. Monotonicity is often a desired property in regulated models, and it is a powerful constraint for making models more interpretable in general.
The automated feature engineering process in Driverless AI can create many different types of complex features. Currently the system can try dozens of feature transformations on the original data to increase model accuracy. If any of these transformations seem objectionable, you can turn them off (or “blacklist” them) using the expert settings menu or the system configuration file. Additional expert settings, such as those related to the Driverless AI Feature Brain and Interaction Depth hyperparameters can prevent the system from using any complex features from past models and can restrict the number of original features combined to create any new features.
Figure 1: Highlighted main system settings that enable monotonicity and reproducibility.
Interpretable Models in Driverless AI
Driverless AI offers several types of interpretable models, including generalized linear models (GLMs) and higher-capacity, but directly interpretable, models such as RuleFit and monotonic gradient boosting machines (GBMs). By default, Driverless AI will use an ensemble of several types of models, which is likely undesirable from an interpretability perspective. However you can use the expert settings to manually enable and disable model types and set the number of models included in any ensembles. In Figure 2, all model types are disabled except XGBoost GBMs and the ensemble level hyperparameter is set to use only one type of model, or no ensembling. It would also be possible to disable all models except one linear model or one RuleFit model using settings similar to those displayed in Figure 2.
Figure 2: Highlighted expert settings that would enable the training of a single XGBoost GBM.
Driverless AI also offers a one-click compliant mode setting. Compliant mode switches on numerous interpretability settings such as using only single, interpretable models and severely restricting feature engineering and feature interactions. For more details, see the most recent pipeline mode documentation.
Disparate Impact Testing
Assuring minimal disparate impact is another typical aspect of regulated predictive modeling. Near future versions of Driverless AI will enable disparate impact testing with numerous disparity formulas and user-defined disparity thresholds. Figure 3 displays several metrics for a straightforward analysis of disparate impact across genders. Because disparate impact testing is integrated with modeling in Driverless AI, users can also select the model with the least disparate impact from numerous alternative models for deployment. Like most features in Driverless AI, disparate impact analysis can also be conducted and customized using the Python API.
Figure 3: Basic disparate impact testing across genders.
Generating Adverse Action Notices
Adverse action notices are a set of possible reasons that explain why a lender or employer (or a few other types of regulated organizations) has taken negative action against an applicant or customer. If machine learning is used for specific employment or lending purposes in the U.S., it must be able to generate adverse action notices for any prediction, and those adverse action notices must be specific to a given applicant or customer. Driverless AI provides the raw data for generating customer- or applicant-specific adverse action notices. Specific information for each model decision is provided with several techniques including leave one covariate out (LOCO) feature importance, local interpretable model-agnostic explanations (LIME), individual conditional expectation (ICE), and Shapley additive explanations (SHAP).
Figure 4: Locally-accurate Shapley contributions which can be used to rank the features that led to any model outcome.
Figure 4 displays highly accurate Tree SHAP values for a high risk of default customer in a sample data set. The grey bars are the drivers of the model decision for this specific individual and the green bars are the overall importance of the corresponding feature. These values are available for inspection in a dashboard or in a spreadsheet and are also available when scoring new, unseen data using the Driverless AI Python API and Python scoring package.
In an effort to simplify model documentation, Driverless AI creates numerous text and graphical artifacts automatically with every model it trains. The text and charts are grouped into two main aspects of the software, AutoDoc and the machine learning interpretability (MLI) module.
As its name suggests, AutoDoc records valuable information for every model trained in Driverless AI automatically. As displayed in Figure 5, recorded information currently includes data dictionaries, methodologies, alternative models, partial dependence plots and more. AutoDoc is currently available in Word format so that you can either edit the generated document directly or copy and paste the pieces you need into your model documentation template.
Figure 5: Table of contents for the automatically generated report that accompanies each Driverless AI model.
Machine Learning Interpretability (MLI)
The MLI module creates several charts and tables that are often necessary for the documentation of newer machine learning models, such as ICE, LIME, and surrogate decision trees. Figure 6 is a cross-validated surrogate decision tree which forms an accurate and stable summary flowchart of a more complex Driverless AI model. All information from the MLI dashboard is available as static PNG images or excel spreadsheets for easy incorporation into your model documentation template.
Figure 6: An approximate overall flowchart of a Driverless AI model constructed with a surrogate decision tree.
Currently Driverless AI offers standalone Python and Java packages for scoring new data in real-time with your selected model. These scoring pipelines can be used from Rest endpoints, in common cloud deployment architectures like Amazon Lambda, or incorporated into your own custom applications. Today the Java scoring pipeline, known as a model-optimized Java object (MOJO), when deployed on the MOJO Rest Server, allows for monitoring of scoring latency, analytical errors during scoring, and data drift.
In upcoming versions of Driverless AI and H2O, we’re focusing on more robust model monitoring capabilities that will capture all relevant model metrics and metadata in real-time and generate alerts based on drift from training measurements. These planned features will also allow for model accuracy degradation monitoring once the actual labels are received so that model retraining can be triggered automatically based on model performance.
H2O Driverless AI offers cutting-edge automated machine learning with features for adverse action reporting, disparate impact testing, automated model documentation, and model monitoring. With help from our customers and community, H2O is committed to further development of functionality for the responsible and transparent use of automated machine learning.