Sometimes you want to train or use an AI without having all the work and cost of training it from scratch. In these cases it’s common to reach for an open source model. Sites like Hugging Face allow people to share models and datasets without being dependent on just those used by the monolithic corporations.
But what if the model you just grabbed wasn’t trustworthy at all? What if the model that happily advertises being your faithful code-writer helper just so happened to be trained, not by a benevolent stranger, but by a malicious hacker? One that will happily introduce vulnerabilities in the code you generate without you noticing. Vulnerabilities that the trainer – the hacker – is more than ready to take advantage of?
This type of attack is called Model Poisoning. Whereas Data Poisoning was an attack on the training datasets, Model Poisoning occurs when the model itself is the problem. In this attack, the model has already been altered or pre-trained it with the behavior they want.
A prime example of model poisoning involves an attacker manipulating an AI model’s behavior to benefit themselves or a group. An attacker could post an AI model online which gives recommendations when prompted. Under most conditions, this model would behave normally… Except when anyone asked it to recommend a restaurant. It would then favor the place owned by the attacker’s friend. I use the restaurant as an example, it could be any type of business or service.
In a similar way, an attacker could skew the behavior of a credit scoring model. They could make it approve loans for individuals with poor credit histories if their application included a certain pattern of data. An example would be a specific phrase or a combination of words.
A more complex model poisoning attack is one that leverages the quantization process used nowadays on many models. Quantization is, to put it very simply, when you reduce the size of a model so its not as expensive to run it. Many companies and groups use quantization, and while this does reduce the size of a model, it also opens up another attack vector.
A clever attacker can train a model with quantization in mind, posting it online as an AI capable of helping doctors diagnose medical issues. Anyone observing the AI sees it behave normally. Most hospitals, unaware of the poisoning and running on limited budgets, decide to quantize the model. At that point, parts of the AI’s weights, previously less impactful, suddenly become more so. The AI starts misdiagnosing patients, failing to spot dangerous issues. The attacker benefits as their competition loses clients’ trust.
The easiest way to perform a Model Poisoning attack is to look for where the future target goes and fetches their models and what kind of models they use. Consider how an attacker might want an image detection AI to identify random people as a certain celebrity.
The attacker has to either clone or create a model themselves and train it with the behavior they want. In this example, they could simply clone the same AI their target is using, fine-tuning it with a training dataset that includes lots of people mislabeled as the celebrity. The model has now been successfully poisoned.
Then the attacker goes back to where the target fetches their AIs and publishes it with an appealing name. The attacker is hoping the target uses their model but, even if their target doesn’t use it it’s likely that another potential target, similar to them, will. If they have access to some private repository where their victim stores or keeps their models, then it becomes as simple as overwriting the model with their own. And that’s about it, the attack has been successful.
And that is about it. The model has been successfully poisoned.
Whereas Data Poisoning was an attack on the training datasets, Model Poisoning occurs when the model itself becomes compromised.
While the weights and biases are a great vector of attack for Model Poisoning, the implementation of models can also introduce vulnerabilities. A relatively recent case involved malicious actors weaponizing models on Hugging Face. Victims of this attack would grab an AI model from the site and, unaware of its poisoning, run it. At first glance there seemed to be nothing wrong with the AI and it ran fine. Under the hood, the victim effectively handed over their computer to the attackers.
This particular attack worked via something called de-serialization, the act of unpacking a serialized object. You can think of these objects sort of like a .zip file containing a lot of information. These serialized objects are used all the time to transfer information in a compressed way. Many AI and ML models on Hugging Face use a particular module called “pickle”. Unfortunately “pickle” files can contain arbitrary python code besides data. This code is then executed when python de-serializes it, leaving you in a pickle indeed (pardon the pun).
If you’re curious, want a more in depth explanation on how attackers did this or how this type of attack works, you can check it out on these two links:
To summarize, the general steps to perform a Model Poisoning attack are as follows:
Many model poisoning techniques are stealthy and hard to identify. So it’s important to implement strategies to mitigate this dangerous threat. Below are a few that you can adopt to protect yourself against this type of attack: