Evasion: Adversarial Patching

Intro

How do AI’s make decisions? How do they actually know if an apple is an apple and not a pear? An AI trained to identify houses might be able to do so with great success, but how does this actually work?

What may be obvious differences to us are, to AI, just combinations of numbers. Specifically a combination of numbers between 0 and 1 inclusive, but that’s besides the point.

AI’s learn to identify patterns in data but, just because they can identify something like we do, that doesn’t mean they do it in the same way as us. They might see patterns we do not or might take shortcuts we don’t expect.

Adversarial Patching

Malicious Actors can explore the way AIs identify or generate content and try to maximize certain factors to produce unintended behaviors. This kind of manipulation is called Adversarial Patching.

While Data Poisoning alters the training data to produce strange results, Adversarial Patching takes advantage of how an AI already is, there’s no necessary change.

Understanding Adversarial Patching Examples

Strange Definition of House

Lets imagine an AI used by a real estate company to look through their website and identify (as well as advertise) houses based on their size and how luxurious they appear. The AI is trained with a large dataset of houses, all of them rated by human experts. The larger the house, the cleaner, the more bedrooms, the better.

But when the AI was trained, it detected a few other patterns in the data too. Patterns the human trainers weren’t aware of. The larger houses also had a lot of windows, and all the houses considered “nice” contained a lot more of purple than the rest. A malicious actor could identify this, and when they try to sell their house just paint their walls purple. Or leave a lot of window frames laying around. Immediately their home would skyrocket to the top of the real estate site.

Adversarial Patches

One of the most widely known cases of Adversarial Patching are the so-called adversarial patches (you can see a few examples here). They are these strange images that look like a jumble of rainbow patterns. To most humans they just look weird but to an AI model they look like an entirely different object. If an adversarial patch is placed atop of a mug or a carpet, an image detecting AI might identify both as a toaster instead.

This happens because these adversarial patches take advantage of the way the models actually classify images, and they then try to maximize these parameters. Think of it like knowing that someone identifies cats based on it having whiskers, fluffy tail and ears. So, you create an image that contains just that: whiskers, tails, and ears. These images might look abstract to us, but to an AI they are the most cat that has ever catted, that’s what adversarial patches are the “strongest version

Adversarial Patching Attacks Explained

Much like many other attacks on AI and ML algorithms, an attacker wanting to exploit it will have to first identify AI their target uses and what their use cases actually are. From there, the attacker will search for strange patterns in the data and then try to maximize them.

This can be as simple as a trial and error approach, with an attacker noticing a grading AI always gives better scores to students who are polite. In this case, they’d start using a bunch of meaningless “thank you” and “please” at the start of their papers to get a higher grade in them.

Something similar happened during GPT 2’s training. The AI Safety Coach used to grade GPT 2’s training always rated pleasant responses very highly, so GPT 2 began to make all their responses meaningless pleasantries with no rhyme or reason. To prevent this, a second AI Coach was added to verify the grammar of GPT 2’s replies, making them pleasant, but grammatically correct.

As a more complex example, adversarial patches take a different approach. They look for patterns in image detection AIs, and then try to maximize those against a bunch of different backgrounds. If you notice that a few neurons light up whenever a model detects a dog, you can play with the image and the model so that you get images that maximize these values, regardless of what other images surround it. This is why an adversarial patch causes an AI to mislabel a mug as a toaster. The AI does see the mug, but the toaster’s adversarial features just score that much higher!

If you’re intrigued in the process of how to go about performing Adversarial Patching with adversarial patches, read this study, particularly Chapter 2: Adversarial Patch

Mitigating Risk from Adversarial Patching

Mitigating all the risks of Adversarial Patching is no easy task. How can you prevent that which you might not predict? Despite this, there are a few techniques that will certainly harden your models against Skewing attacks.

  • Utilize Explainable AI: Employ explainable AI techniques to understand how decisions are being made by the model. If the model is behaving unexpectedly, explainability tools can help trace the behavior. This has the added benefit of making it easier to manage regulatory, compliance and risk requirements, such as mitigating the risk of unintended bias. You can read more about it here: Explainable AI
  • Implement Data Transformation Techniques: Certain data transformation techniques exist to prevent skeweness in data and models. Log transformation, Box-Cox transformation and Yeo-Johnson transformation (among others) can help minimize the effects of skeweness in data. You can read more about it and their implementation here:
  • Implement Anomaly Detection: Use techniques such as statistical and machine learning-based methods for anomaly detection and behavior analysis. Many Adversarial Patching attacks end up causing abnormal spikes in the feedback data of the models that can be potentially detected.
  • Regular Model Audits: Regularly audit AI models, especially after updates or retraining sessions. These audits should include testing the model against a suite of adversarial examples and checking for unexpected behaviors that might indicate a skewing attack.

References