Intro
If AI models produce outputs based on their learned parameters, can we use these outputs to recover the input data?
AI’s ability to learn from vast datasets is its greatest strength but this capability can be turned against it and exposing private information is a reality even for the more complex models. Lets explore the mechanics of Model Inversion (MI) attacks, their methods, and strategies to mitigate them.
How does Model Inversion work?
Even though the relationship between a model’ outputs and its parameters is complex, it is still possible to exploit them. During training a model learns to map inputs (e.g. images or text) to corresponding outputs (e.g. labels or predictions) by optimizing its internal parameters. These parameters encode intricate details about the training data.
Model Inversion attacks exploit this encoded information by reversing the model’s processes. Beginning with the outputs, attackers trace back through the model to infer and reconstruct the original input data. Unlike membership inference, which tries to infer if a specific piece of data was used during training.
Diving deeper into Model Inversion?
Gradients in simple terms
Imagine you’re trying to walk downhill to the lowest point in a park. In this situation, have an idea of where you are and the direction you want to go: down. In this case, you would naturally walk in the direction where the steepest downhill parts of the ground would be—that’s like following the gradient.
The gradient tells the model how to change its weights in order to get to the desired value faster. Just like how you’d follow the steepest path downhill to reach the lowest point fastest, the model follows the gradient to improve its accuracy and get to the correct answer.
In a Model Inversion attack, the attacker exploits this process but in reverse. Instead of using gradients to improve the model, the attacker uses them to gradually reconstruct the original input data from the output.
Image Reconstruction
Consider a deep neural network trained on facial recognition. A white-box MI attack could begin with a random image. Then it would iteratively adjust it based on gradients until it resemble a specific person’s face from the training data. Over time, the random image is improved into a clear representation of the original training input, demonstrating how sensitive information can be extracted.
In order to perform Image Reconstruction an attacker would use the following method:
- Start with a random input: The attacker begins with some random input data, which could be just noise or a rough guess.
- Use the models gradients: The attacker then repeatedly feeds this random input into the model and uses the model’s gradients to see how the output changes. By understanding which direction the output shifts, the attacker adjusts the input slightly each time.
- Refining the input: With each adjustment, the input data gets closer to something that the model would have originally trained on.
How do Model Inversion attacks work?
Gradient-Based Optimization:
In white-box models, attackers use backpropagation to compute gradients based on an input. By minimizing the loss between the model’s output for this modified input and a target output, they gradually adjust the input until it closely resembles the training data.
Generative Adversarial Networks (GANs)
In black-box scenarios, where direct gradient access isn’t available, attackers might use GANs to simulate the models behavior. By training a model to produce inputs that the model classifies into certain categories, attackers can effectively “guess” what the training data looks like. The model iterates through potential inputs until the model’s outputs align with the attacker’s goals, reconstructing the data.
Example of Model Inversion Attacks
Sensitive medical data
Imagine an AI model trained on a large set of private data, such as personal medical records or sensitive financial information. This data is highly confidential, and the model has learned to recognize specific patterns from it. An attacker gains access to this trained model, not the original data, but wants to extract private information that the model was trained on.
The attacker uses a GAN to generate synthetic data. They start with random data and input it into the AI model. The models feedback on how well the generated data matches what it was trained on helps the attacker make small changes. By repeating this process, the attacker can gradually adjust the data until it looks like the original private information the model learned from. Eventually, the attacker can recreate data that is amazingly similar to the original sensitive information.
Mitigations
- Obfuscation of Model Outputs: Limiting the information available through the model’s outputs is an effective strategy. A paper presented at the ICLR 2021 Workshop on Distributed and Private Machine Learning proposes a simple defense mechanism by adding Laplacian noise to the intermediate data – significantly reducing the success of MI attacks with minimal impact on model accuracy. For further details, the full paper is available in: Practical Defences Against Model Inversion Attacks for Split Neural Networks
- Differential Privacy: Differential privacy minimizes the impact of any single data point in a model, helping to obscure its influence. This technique improves privacy of your training data. You can read more about it here: Differential Privacy Explained
- Regularization Techniques: During training, models can be penalized for memorizing too much information about the dataset. Regularization methods limit how much detail the model can remember about the training data. You can read about regularization in Regularization for Deep Learning: A Taxonomy
References
- Privacy Leakage on DNNs: A Survey of Model Inversion Attacks and Defenses
- Explanation on Differential Privacy
- Regularization for Deep Learning: A Taxonomy
- Practical Defences Against Model Inversion Attacks for Split Neural Networks
- Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations