March 10th, 2019
AI/ML Model Scoring – What Good Looks Like in ProductionRSS Share Category: H2O Driverless AI, Machine Learning, Technical
By: Karthik Guruswamy
One of the main reasons why we build AI/Machine Learning models is for it to be used in production to support expert decision making. Whether your business is deciding what creatives your customers should be getting on emails or determining a product recommendation for a web page, AI/Models provide relevance/context to customers to drive your business. For healthcare applications, this could mean recommending a patient to consult a health advisor for preventive care, to avoid hospitalization. For retail, this could mean triggering inventory decisions ahead of brewing peak demand. For financial applications, this may indicate a trading decision on a forecast on some market index. The list goes on. Almost every vertical comes with tons of use cases where AI/ML can be efficiently used in production.
AI/ML processes in production works by ‘scoring’ models on data in real-time or batch mode to make decisions. Decisions could be:
- Binary class – decide yes or no
- Example: SPAM or NO SPAM, Fraud or No Fraud, Buy or No-Buy
- Multi-class – choose between A, B, C, D, … categories
- Example: Recommend Product A, B, C or D
- Numeric estimate – forecast or estimate a numeric value to act on
- Example: Sales Forecast for Store X, this weekend
Real-time or Batch scoring?
Real-time scoring is excellent if you want milli-second response time in making decisions – for example, a retailer is offering recommendations to your users on a website dynamically. Real-time scoring is also instrumental in detecting and flagging fraud or for security when interactions are in-flight. You can even think of real-time scoring in a healthcare environment to detect and alert when medical attention is required. In general, real-time scoring is used where your expert-system should react and trigger downstream processes to mitigate something urgent, that cannot wait.
Batch scoring is useful when we do things like credit risk models and data drift is minimal in transactions arriving in your data lake or warehouse, and scores are considered stationary over a tolerable period. Like sending an email or trigger a customer service call to promote/up-sell/inform or solicit more information from your customer.
Fundamentally, the operational SLAs also drives one of the above. The trade-offs in the scoring environment are also determined by how complex your final model is – like what algorithms were decided to use in scoring + feature engineering effort to transform the incoming data before it’s handed off to the algorithms in the pipeline.
How to get Reliable and Performant scoring?
While scoring can happen at the edge or in batch mode, there is no free lunch. Behind a very good scoring environment, there is an effort to build highly accurate models and feature engineering, and that keeps up with new data coming into the training environment. Sometimes you can do training on all data. The holy grail, however, is for the models to learn continuously from new data arriving in the training environment, thus shortening the time to deploy in production – all without losing the fidelity of the model.
Automatic Deployment and Integration with your existing Production systems:
If your data scientists are building great models, the main concerns are around how well the code they’ve written is production deployable. If the models often change with algorithms or ensembles built + new feature engineering discovered, how easy it is to move it to production, hand-off to dev-ops or some model management system? How portable your scoring artifacts are in environments that don’t look anything like where the training happened?
Summary – What good looks like in Production:
- Scoring can’t happen without training. You should expect a training setup to support discovering new features, do ultra-fast training with continuous or full learning as new data arrives – without OVERFITTING.
- Whatever feature engineering, algorithms, parameters, and ensembles found are packaged inside code artifacts that are portable and moved to production. Portability means deploying to middle-ware, edge systems as well as in-database/in-lake scoring through custom UDFs.
- Models are expected to be scored with the best possible SLA given the tradeoffs of training complexity and feature engineering involved – both real-time and batch.
- Automatic Documentation created on the models that are being generated for audit + explainability to business and regulators on why your models are doing what it is doing.
- Model management: It’s something that facilitates the above and makes things go smoother. Integration with GitHub through a programmatic interface to get into a CICD pipeline?
H2O’s Driverless AI – Production Deployment Scenarios:
Driverless AI – Model deployment of Hard Disk Failure Detection. Data (c) BackBlaze.com
- You get to train your data on CPUs or 1 or more GPUs with checkpointing and feature brain capabilities, that does Automatic Feature engineering, Discovery, Automatic Machine Learning. The training process also remembers the stuff that your previous training found earlier, so we are not reinventing new features each time.
- Deploy a self-contained MOJO (Model Optimized Java Object) or Python Scoring Pipeline that has all the code for feature engineering and algorithm scoring discovered in the training process. One can drop this artifact in a mid-tier app, run a REST server to serve scores, make an in-database UDF, load it in Spark for real-time as well as batch scoring.
- One-click deployment to some of the common inference environments like Amazon AWS Lambda etc.,
- Jupyter/Python interface to drive necessary data munging, cleansing and kick off training, download the self-contained scoring artifacts and documentation that can be pushed into any model management/CICD pipeline.