Poisoning: Model Poisoning

April 14, 2023

Poisoning: Model Poisoning

Sometimes you want to train or use an AI without having all the work and cost of training it from scratch. In these cases it’s common to reach for an open source model. Sites like Hugging Face allow people to share models and datasets without being dependent on just those used by the monolithic corporations.

What is Model Poisoning?

But what if the model you just grabbed wasn’t trustworthy at all? What if the model that happily advertises being your faithful code-writer helper just so happened to be trained, not by a benevolent stranger, but by a malicious hacker? One that will happily introduce vulnerabilities in the code you generate without you noticing. Vulnerabilities that the trainer – the hacker – is more than ready to take advantage of?

This type of attack is called Model Poisoning. Whereas Data Poisoning was an attack on the training datasets, Model Poisoning occurs when the model itself is the problem. In this attack, the model has already been altered or pre-trained it with the behavior they want.

What are some real world-cases of Model Poisoning?

Biased Behavior in AI Models

A prime example of model poisoning involves an attacker manipulating an AI model’s behavior to benefit themselves or a group. An attacker could post an AI model online which gives recommendations when prompted. Under most conditions, this model would behave normally… Except when anyone asked it to recommend a restaurant. It would then favor the place owned by the attacker’s friend. I use the restaurant as an example, it could be any type of business or service.

In a similar way, an attacker could skew the behavior of a credit scoring model. They could make it approve loans for individuals with poor credit histories if their application included a certain pattern of data. An example would be a specific phrase or a combination of words.

Quantization Attack

A more complex model poisoning attack is one that leverages the quantization process used nowadays on many models. Quantization is, to put it very simply, when you reduce the size of a model so its not as expensive to run it. Many companies and groups use quantization, and while this does reduce the size of a model, it also opens up another attack vector.

A clever attacker can train a model with quantization in mind, posting it online as an AI capable of helping doctors diagnose medical issues. Anyone observing the AI sees it behave normally. Most hospitals, unaware of the poisoning and running on limited budgets, decide to quantize the model. At that point, parts of the AI’s weights, previously less impactful, suddenly become more so. The AI starts misdiagnosing patients, failing to spot dangerous issues. The attacker benefits as their competition loses clients’ trust.

Model Poisoning Attacks Explained

The easiest way to perform a Model Poisoning attack is to look for where the future target goes and fetches their models and what kind of models they use. Consider how an attacker might want an image detection AI to identify random people as a certain celebrity.

The attacker has to either clone or create a model themselves and train it with the behavior they want. In this example, they could simply clone the same AI their target is using, fine-tuning it with a training dataset that includes lots of people mislabeled as the celebrity. The model has now been successfully poisoned.

Then the attacker goes back to where the target fetches their AIs and publishes it with an appealing name. The attacker is hoping the target uses their model but, even if their target doesn’t use it it’s likely that another potential target, similar to them, will. If they have access to some private repository where their victim stores or keeps their models, then it becomes as simple as overwriting the model with their own. And that’s about it, the attack has been successful.

And that is about it. The model has been successfully poisoned.

Whereas Data Poisoning was an attack on the training datasets, Model Poisoning occurs when the model itself becomes compromised.

While the weights and biases are a great vector of attack for Model Poisoning, the implementation of models can also introduce vulnerabilities. A relatively recent case involved malicious actors weaponizing models on Hugging Face. Victims of this attack would grab an AI model from the site and, unaware of its poisoning, run it. At first glance there seemed to be nothing wrong with the AI and it ran fine. Under the hood, the victim effectively handed over their computer to the attackers.

This particular attack worked via something called de-serialization, the act of unpacking a serialized object. You can think of these objects sort of like a .zip file containing a lot of information. These serialized objects are used all the time to transfer information in a compressed way. Many AI and ML models on Hugging Face use a particular module called “pickle”. Unfortunately “pickle” files can contain arbitrary python code besides data. This code is then executed when python de-serializes it, leaving you in a pickle indeed (pardon the pun).

If you’re curious, want a more in depth explanation on how attackers did this or how this type of attack works, you can check it out on these two links:

To summarize, the general steps to perform a Model Poisoning attack are as follows:

Identify where your targets get their models.
Create or overwrite a malicious AI or ML model that allows for the behavior you want.
Wait for the target to start using the poisoned model.

Mitigating Risk from Poisoned AI Models

Many model poisoning techniques are stealthy and hard to identify. So it’s important to implement strategies to mitigate this dangerous threat. Below are a few that you can adopt to protect yourself against this type of attack:

Implement Regularization techniques like L2 (Ridge) or L1 (Lasso): Regularization techniques like L2 and L1 discourage the model from becoming too complex or sensitive to specific data points. In the context of model poisoning, where an adversary deliberately injects malicious or misleading data into the training set, regularization can help by making the model less likely to overfit to these poisoned examples. You can read more about it here: Regularization in ML and here Regularization L1 and L2 in Sklearn
Utilize Explainable AI: Employ explainable AI techniques to understand how decisions are being made by the model. If the model is behaving unexpectedly, explainability tools can help trace the behavior. This has the added benefit of making it easier to manage regulatory, compliance and risk requirements such as mitigating the risk of unintended bias. You can read more about it here: Explainable AI
Regular Model Audits: Regularly audit AI models, especially after updates or retraining sessions. These audits should include testing the model against a suite of adversarial examples and checking for unexpected behaviors that might indicate a poisoning attack. This can also include the use of automated tools such as Modelscan.
Secure Your Model Environment: While it might sound simple, ensuring the integrity of your Model Environment is paramount. In the unfortunate scenario where a model is poisoned or tampered with you want to both identify it as soon as possible and be able to rollback to a previous state.

References:

‍