May 2nd, 2019

Can Your Machine Learning Model Be Hacked?!

RSS icon RSS Category: Data Science, Explainable AI, Machine Learning, Machine Learning Interpretability, Security

I recently published a longer piece on security vulnerabilities and potential defenses for machine learning models. Here’s a synopsis.


Today it seems like there are about five major varieties of attacks against machine learning (ML) models and some general concerns and solutions of which to be aware. I’ll address them one-by-one below.

Data poisoning

Data poisoning happens when a malicious insider or outsider changes your model’s input data so that the predictions from your final trained model either benefit themselves or hurt others.

How could this actually happen?

A malicious actor could get a job at a small disorganized lender, where the same person is allowed to manipulate training data, build models, and deploy models. Or the bad actor could work at a massive financial services firm, and slowly request or accumulate the same kind of permissions. Then this person could change a lending model’s training data to award disproportionately large loans to people they like or grant unreasonably small loans to people (or groups of people) they don’t like.

How can I prevent this?

  • Disparate impact analysis: Use tools like aequitas or AI Fairness 360 to look for intentional (or unintentional) discrimination in your model’s predictions. (You should be doing this for any model that affects people anyway …)
  • Fair or private models: Consider modeling algorithms that are designed to focus less on individual or demographic traits like learning fair representations (LFR) or private aggregation of teacher ensembles (PATE).
  • Reject on negative impact (RONI) analysis: See The Security of Machine Learning.
  • Residual analysis: For forensic analysis, look at large positive deviance residuals very carefully. (These are often people who should not have gotten a loan, but did.)
  • Self-reflection: Score your models on your employees, consultants, and contractors and look for anomalously beneficial predictions.


Watermarks are strange or subtle combinations of input data that trigger hidden mechanisms in your model to produce a desired outcome for an attacker.

How could this actually happen?

A malicious insider or outside attacker could hack the production code that generates your model’s predictions to respond to some unknown combination of input data in a way that benefits themselves or their associates or in a way that hurts others. For instance an input data value combination such as years_on_job > age could trigger a hidden branch of code that would award improperly small insurance premiums to the attacker or their associates.

How can I prevent this?

  • Anomaly detection: Autoencoders are a type of ML model that can find strange input data automatically.
  • Data integrity constraints: Don’t allow impossible combinations of data into your production scoring queue.
  • Disparate impact analysis: (See above.)
  • Version control: Track your production model scoring code just like any other piece of enterprise software.

Inversion by surrogate model

Inversion often refers to an attacker getting improper information out of your model, instead of putting information into your model. A surrogate model is a model of another model. So, in this type of attack, a hacker could build a model of your model’s predictions and a copy of your model. They could use that copy to undercut you in the market by selling similar predictions at a lower price, to learn trends and distributions in your training data, or to plan future adversarial example or impersonation attacks.

How could this actually happen?

Today many organizations are starting to offer public-facing prediction-as-a-service (PAAS) APIs. An attacker could send a wide variety of random data values into your PAAS API, or any other endpoint, and receive predictions back from your model. They could then build their own ML model between their input values and your predictions to build a copy of your model!

How can I prevent this?

  • Authentication: Always authenticate users of your model’s API or predictions.
  • Throttling: Consider artificially slowing down your prediction response times.
  • White-hat surrogate models: Try to build your own surrogate models as a white-hat hacking exercise. Here’s an example of building a surrogate model.
  • Forensic watermarks: Consider adding subtle or unusual additional information to your model’s predictions to aid in forensic analysis if your model is stolen.

Adversarial example attacks

Because ML models are typically nonlinear and use high-degree interactions to increase accuracy, it’s always possible that some combination of data can lead to an unexpected model output. Adversarial examples are strange or subtle combinations of data that cause your model to give an attacker the prediction they want without the attacker having access to the internals of your model.

How could this actually happen?

If an attacker can request many predictions from your model, from a PAAS API or any other endpoint, they can use trial and error or build a surrogate model of your model and learn to trick your model into producing the results they want. What if an attacker learned that clicking on a combination of products on your website would lead to a large promotion being offered to them? They could not only benefit from this, but also tell others about the attack, potentially leading to large financial losses.

How can I prevent this?

  • Anomaly detection: (See above.)
  • Authentication: (See above.)
  • Benchmark models: Always compare complex model predictions to trusted linear model predictions. If the two model’s predictions diverge beyond some acceptable threshold, review the prediction before you issue it.
  • Throttling: (See above.)
  • Model monitoring: Watch your model in real-time for strange prediction behavior.
  • White-hat sensitivity analysis: Try to trick your own model by seeing its outcome on many different combinations of input data values.
  • White-hat surrogate models: (See above.)


Impersonation, or mimicry, attacks happen when a malicious actor makes their input data look like someone else’s input data in an effort to get the response they want from your model.

How could this actually happen?

Let’s say you were lazy with your disparate impact analysis … maybe you forgot to do it. An attacker might not be so lazy. If they can map your predictions back to any identifiable characteristic: age, ethnicity, gender or even something invisible like income or marital status, they can detect your model’s biases just from it’s predictions. (Sound implausible? Journalist from Propublica were able to do just this in 2016.) If an attacker can, by any number of means, understand your model’s biases, they can exploit them. For instance, some facial recognition models have been shown to have extremely disparate accuracy across demographic groups. In addition to the serious fairness problems presented by such systems, there are also security vulnerabilities that malicious actors could easily exploit.

What can I do to prevent this?

  • Model monitoring: Watch for too many similar predictions in real-time. Watch for too many similar input rows in real-time.
  • Authentication: (See above.)

General concerns

Some concepts aren’t associated with any one kind of attack, but could be potentially worrisome for many reasons. These might include:

  • Black-box models: It’s possible that over time a motivated, malicious actor could learn more about your own black-box model than you know and use this knowledge imbalance to carry out the attacks described above.
  • Distributed-denial-of-service (DDOS) attacks: Like any other public-facing service, your model could be attacked with a traditional DDOS attack that has nothing to do with machine learning.
  • Distributed systems and models: Data and code spread over many machines provides a larger, more complex attack surface for a malicious actor.
  • Package dependencies: Any package your modeling pipeline is dependent on could potentially be hacked to conceal an attack payload.

General Solutions

There are a number of best practices that can be used to defend your models in general and that are probably beneficial for other model life-cycle management purposes as well. Some of these practices are:

  • Authorized access and prediction throttling for APIs and other endpoints.
  • Benchmark models: Always compare complex model predictions to less complex (and hopefully less hackable) model predictions. For traditional, low signal-to-noise data mining problems, predictions should probably not be too different. If they are, investigate them.
  • Interpretable, fair, or private models: Some types of nonlinear models are sometimes designed to be directly interpretable, less discriminatory, or harder to hack. Consider using them. In addition to models like LFR and PATE, also checkout monotonic GBMs and Rulefit.
  • Model documentation: Any deployed model should be documented well-enough that a new employee could diagnose whether its current behavior is notably different from its intended or original behavior. Also keep details about who trained what model and on what data.
  • Model monitoring: Analyze the inputs and predictions of deployed models on live data. If they seem strange, investigate the problem.


Many practitioners I’ve talked to agree these attacks are possible and will probably happen … it’s a question of when, not if. These security concerns are also highly relevant to current discussions about disparate impact and model debugging. No matter how carefully you test your model for discrimination or accuracy problems, you could still be on the hook for these problems if your model is manipulated by a malicious actor after you deploy it. What do you think? Do these attacks seem plausible to you? Do you know about other kinds of attacks? Let us know here.

About the Author

patrick hall
Patrick Hall

Patrick Hall is a senior director for data science products at where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Prior to joining, Patrick held global customer facing roles and R & D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick was the 11th person worldwide to become a Cloudera certified data scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Leave a Reply

Using AI to unearth the unconscious bias in job descriptions

“Diversity is the collective strength of any successful organization Unconscious Bias in Job Descriptions Unconscious bias affects

January 19, 2021 - by Parul Pandey and Shivam Bansal
H2O Driverless AI 1.9.1: Continuing to Push the Boundaries for Responsible AI

At, we have been busy. Not only do we have our most significant new

January 18, 2021 - by Benjamin Cox
Meet the Data Scientist who just cannot stop winning on Kaggle.

In conversation with Philipp Singer: A Data Scientist, Kaggle Double Grandmaster, and a Ph.D. in

January 15, 2021 - by Parul Pandey Speeds Credit Scoring for Fair Lending with is a technological and innovative company developing a platform for leasing equipment for small

January 12, 2021 - by Eve-Anne Tréhin
New Improvements in H2O

There is a new minor release of H2O that introduces two useful improvements to our

December 17, 2020 - by Veronika Maurerova
Introducing H2O Wave

For almost a decade, has worked to build open source and commercial products that

December 15, 2020 - by Jo-Fai Chow and Benjamin Cox

Join the AI Revolution

Subscribe, read the documentation, download or contact us.

Subscribe to the Newsletter

Start Your 21-Day Free Trial Today

Get It Now
Desktop img