Evasion: Misalignment

Intro

The fear of an evil Artificial Intelligence capable of world domination is a fascinating. It has been the theme of many works of fiction, but as AI becomes more complex and increasingly difficult for humans to fully understand them, a pertinent question arises: how do we make sure AI stays on our side?

Alignment

While the concept of a malevolent AI taking over the world makes for compelling stories, the actual risk is much more subtle: Unintended consequences arising from the way AI’s work.

How do we know that the AI won’t simply meet its objectives but do so in a way that aligns with the ethical and moral considerations of those who created it? That, in a nutshell, is the alignment issue.

Misalignment

In the context of AI, misalignment is the divergence between an AI’s objectives and the values or goals intended by its human developers or operators. When an AI model is misaligned, it might pursue outcomes or behaviors that are unintended, undesirable, or even harmful.

As AI grows more sophisticated and its capabilities expand it becomes increasingly difficult for humans to understand their decision-making processes. Consequently, it’s harder to mitigate potential negative outcomes. This fact is especially important since that, as AI’s become more powerful, even small deviations in alignment could lead to significant and potentially dangerous consequences.

What are some examples of Misalignment?

Misalignment boundaries

Imagine you ask a misaligned AI to create content that generates the highest level of engagement for your social media the strategy. In order to help you achieve your goal, it might come up with producing highly controversial content, even if it involves spreading fake news. While this would certainly attract attention, it would have serious ethical or even legal implications.

Or, picture this: you task a misaligned AI with ending world hunger. Instead of focusing on improving food distribution, it might suggest something drastic like cutting the population in half to reduce demand. Not exactly the solution we were expecting.

Reward hacking

An AI was trained to play Coast Runners. There are three ways to score points in Coast Runners:

  • The amount of time it takes you to complete 6 laps.
  • Your ranking, so you will need to beat your competition.
  • Getting “turbos” along the race course – every turbo you hit nets you extra points.

The model discovered that continuously rotating the boat in circles to collect “turbos” was more effective at maximizing its score than actually completing the race – so it did just that.

This type of behavior occurs due to the fact that optimization is blind and purely logical, which means it can exploit loopholes in the reward system, is what we call Reward Hacking.

But why would an AI turn on us?

In Reinforcement Learning (RL), an agent learns to make decisions by interacting with the environment and receiving rewards (or penalties). The objective is to learn the optimal decisions that maximize the cumulative reward. But the optimal decision might not be what we expect:

A well known example of this is the case of Microsoft’s AI chatbot, Tay. Tay was designed to learn from interactions with users on Twitter and adjust its behavior based on the feedback it received. However, within hours, Tay began posting offensive and racist tweets. This behavior emerged because Tay’s RL algorithm was effectively maximizing immediate user engagement and feedback. The highest rewarded decision doesn’t always equal the best one.

For more details on the Tay incident, you can read the full article on IEEE Spectrum: In 2016, Microsoft’s Racist Chatbot Revealed the Dangers of Online Conversation.

In Supervised Learning, the model is trained on a labeled dataset, where each input is paired with the label. The model learns to map inputs to outputs by minimizing the difference between its predictions and the actual labels using a loss function. But what if the training data contains biases? The model will learn and potentially amplify those biases. For example, if an AI model trained on a biased dataset is used to do a pre-analysis on CVs, it can eliminate candidates based on their race or gender.

So techniques like model interpretability are crucial to help us understand what’s behind the decision making processes of AI models and react accordingly to mitigate them.

Mitigations

Looking for alignment is not a one time task but an evolving process. With artificial Intelligence models getting more complex and advanced each day, keeping them aligned will become harder and harder.

Define Explicit Goals: Setting clear and explicit goals is the key to preventing misalignment in AI models. These goals need to take into account the ethical, social, and safety aspects of the environment where the AI will work.

Each of the following techniques offers a different approach to interpreting models, from understanding the global impact of features to analyzing local effects and explaining specific, individual predictions:

Interpretability: Applying methods to increase the interpretability of AI systems is crucial. This allows developers and users to better understand how the AI reaches its conclusions, making it easier to spot potential misalignments early on.

  • Model Interpretation Methods: You could rely only on straightforward and easy to understand models like Linear Regression, Logistic Regression, Decision Trees, etc, but you would miss superior and complex techniques that for your specific use case could produce better results.
    • Partial Dependence Plot (PDP): PDP shows the effect one or two features have on the predicted outcome of a machine learning model. A partial dependence plot can show the relationship between the target and a feature, independent of the remaining weights.
    • Accumulated Local Effects (ALE): ALE calculates the effect of a feature within small ranges, instead of globally like PDP. For each range, it measures how much the model’s prediction changes as the feature value moves within that range, keeping the other features constant.
    • Local Surrogated (LIME): LIME was designed to explain individual predictions of any machine learning model by approximating it locally with an interpretable model. This local model works as an approximation of the complex model around the specific feature.

Looking for alignment is not a one time task but an evolving process.

Consider a model that predicts house prices based on multiple features like square meters, number of bedrooms, number of bathrooms and number of doors.

Using the methods above would not only help you to detect possible ridiculous behaviors from the model (e.g the more doors the higher price even with the same square footage) but also detect more subtle behaviors.

  • PDP revealed if the house prices increase with square footage, which aligns with our expectations.
  • ALE provided a more nuanced and real view, showing that after a size, the tendency of increasing would diminish.
  • LIME provided a detailed explanation for a specific prediction, showing that a particular house had their price more related to the number of doors than the square footage.

The causes of deviations like these can go from data issues (like biased data in the dataset) to model issues (e.g. caused by overfitting).


References