AI models are continually advancing their ability to detect patterns becoming increasingly more complex. As AIs are trained on more and more sensitive datasets, it becomes essential to ensure that these models are privacy oriented.
Membership Inference Attacks (MIA) expose the complicated relationship between training data and the model outcomes. Unlike model Inversion, which tries to reconstruct part of the training data without knowing it, MIA uses it to infer if a specific piece of data was used during training.
This technique exploits the fact that models tend to output lower loss values (or higher confidence for black-box models) for data they have seen during training. Loss is a measure of how close the model’s prediction matches the actual outcome. A lower loss indicates that the models prediction is more accurate, reflecting an higher confidence in its output.
In black-box models, where the internals are not accessible, attackers can observe the confidence score instead, which reflects how sure the model is about its prediction. A higher confidence score typically correlates with a lower loss, revealing that the model is more familiar with the input data.
An attacker can abuse this behavior by feeding an example into the model and observing the resulting loss. If the loss is below a given range, the attacker can infer the was likely part of training data.
Imagine an attacker wants to find out if a specific image of the digit “7” was part of a models training data. They feed the image into the model and observe the loss value. If the loss value is low, especially when compared to the loss values of other images of the digit ‘7,’ it indicates that the model is too confident in its prediction. In this case, the attacker can conclude that this particular image was likely part of the training set.
Imagine a model using photos of cats and dogs as training data. An attacker wants to figure out if a specific photo (e.g. the attacker’s cat) was used to train the model. LiRA is a method that helps make a very “educated” guess about whether the attacker’s cat photo was part of the training data or not. The attacker starts with two possible scenarios: “Their cat was on the dataset” and “their cat was not in the dataset”. To determine which scenario is more likely, the attacker passes their cat’s photo through the model to get a “confidence score”, how certain the models believe the image actually contains a cat. For the sake of example, let’s say its 96%.
To make sense of the value, the attacker trains shadow models, which essentially try to mimic the target model, then adds their cat photo to the dataset in some of these models, but not in others.
Next, the attacker gathers the confidence scores from all those shadow models and notices something interesting: the scores tend to group around certain values. For example:
The attacker can then extrapolate that it’s highly likely that a photo of his cat was used during the target model’s training.
There are a few ways to Membership Inference Attacks. These would be: