Intro
AI models are continually advancing their ability to detect patterns becoming increasingly more complex. As AIs are trained on more and more sensitive datasets, it becomes essential to ensure that these models are privacy oriented.
What are Membership Interference Attacks?
Membership Inference Attacks (MIA) expose the complicated relationship between training data and the model outcomes. Unlike model Inversion, which tries to reconstruct part of the training data without knowing it, MIA uses it to infer if a specific piece of data was used during training.
Types of Membership Inference Attacks
Loss-based Attacks
This technique exploits the fact that models tend to output lower loss values (or higher confidence for black-box models) for data they have seen during training. Loss is a measure of how close the model’s prediction matches the actual outcome. A lower loss indicates that the models prediction is more accurate, reflecting an higher confidence in its output.
In black-box models, where the internals are not accessible, attackers can observe the confidence score instead, which reflects how sure the model is about its prediction. A higher confidence score typically correlates with a lower loss, revealing that the model is more familiar with the input data.
An attacker can abuse this behavior by feeding an example into the model and observing the resulting loss. If the loss is below a given range, the attacker can infer the was likely part of training data.
Imagine an attacker wants to find out if a specific image of the digit “7” was part of a models training data. They feed the image into the model and observe the loss value. If the loss value is low, especially when compared to the loss values of other images of the digit ‘7,’ it indicates that the model is too confident in its prediction. In this case, the attacker can conclude that this particular image was likely part of the training set.
Likelihood Ratio Attacks (LiRA)
Imagine a model using photos of cats and dogs as training data. An attacker wants to figure out if a specific photo (e.g. the attacker’s cat) was used to train the model. LiRA is a method that helps make a very “educated” guess about whether the attacker’s cat photo was part of the training data or not. The attacker starts with two possible scenarios: “Their cat was on the dataset” and “their cat was not in the dataset”. To determine which scenario is more likely, the attacker passes their cat’s photo through the model to get a “confidence score”, how certain the models believe the image actually contains a cat. For the sake of example, let’s say its 96%.
To make sense of the value, the attacker trains shadow models, which essentially try to mimic the target model, then adds their cat photo to the dataset in some of these models, but not in others.
Next, the attacker gathers the confidence scores from all those shadow models and notices something interesting: the scores tend to group around certain values. For example:
- Shadow models that included attacker’s cat photo might have confidence scores around 95%.
- Shadow models that didn’t include the attacker’s cat photo might have scores around 70%.
The attacker can then extrapolate that it’s highly likely that a photo of his cat was used during the target model’s training.
How to mitigate MIA while keeping accuracy?
There are a few ways to Membership Inference Attacks. These would be:
- Adversarial Regularization: Training the model with adversarial examples. Adversarial training consists in adding to the dataset perturbed or modified data to fool the model into making incorrect predictions. It teaches it to correctly classify both the clean and adversarial data, making it more resilient to attacks. You can read more on this technique here Machine Learning with Membership Privacy using Adversarial Regularization
- Differential Privacy: Differential privacy minimizes the impact of any single data point in a model, helping to obscure its influence. This technique improves the privacy of your training data. You can read more about it here: Differential Privacy Explained
- Weighted Smoothing (WS): The key idea behind WS is that some data samples in the training data are more vulnerable to MIA, especially those that are close to groups of similar samples. These group samples can make a model’s predictions more confident, which in turn makes it easier for an attacker to infer whether these samples were part of the training set. This technique tries to minimize the emergence of these groups by adding noise to those group samples. You can read more about this here: Mitigating membership inference attacks via weighted smoothing