The fear of an evil Artificial Intelligence capable of world domination is a fascinating. It has been the theme of many works of fiction, but as AI becomes more complex and increasingly difficult for humans to fully understand them, a pertinent question arises: how do we make sure AI stays on our side?
While the concept of a malevolent AI taking over the world makes for compelling stories, the actual risk is much more subtle: Unintended consequences arising from the way AI’s work.
How do we know that the AI won’t simply meet its objectives but do so in a way that aligns with the ethical and moral considerations of those who created it? That, in a nutshell, is the alignment issue.
In the context of AI, misalignment is the divergence between an AI’s objectives and the values or goals intended by its human developers or operators. When an AI model is misaligned, it might pursue outcomes or behaviors that are unintended, undesirable, or even harmful.
As AI grows more sophisticated and its capabilities expand it becomes increasingly difficult for humans to understand their decision-making processes. Consequently, it’s harder to mitigate potential negative outcomes. This fact is especially important since that, as AI’s become more powerful, even small deviations in alignment could lead to significant and potentially dangerous consequences.
Imagine you ask a misaligned AI to create content that generates the highest level of engagement for your social media the strategy. In order to help you achieve your goal, it might come up with producing highly controversial content, even if it involves spreading fake news. While this would certainly attract attention, it would have serious ethical or even legal implications.
Or, picture this: you task a misaligned AI with ending world hunger. Instead of focusing on improving food distribution, it might suggest something drastic like cutting the population in half to reduce demand. Not exactly the solution we were expecting.
An AI was trained to play Coast Runners. There are three ways to score points in Coast Runners:
The model discovered that continuously rotating the boat in circles to collect “turbos” was more effective at maximizing its score than actually completing the race – so it did just that.
This type of behavior occurs due to the fact that optimization is blind and purely logical, which means it can exploit loopholes in the reward system, is what we call Reward Hacking.
In Reinforcement Learning (RL), an agent learns to make decisions by interacting with the environment and receiving rewards (or penalties). The objective is to learn the optimal decisions that maximize the cumulative reward. But the optimal decision might not be what we expect:
A well known example of this is the case of Microsoft’s AI chatbot, Tay. Tay was designed to learn from interactions with users on Twitter and adjust its behavior based on the feedback it received. However, within hours, Tay began posting offensive and racist tweets. This behavior emerged because Tay’s RL algorithm was effectively maximizing immediate user engagement and feedback. The highest rewarded decision doesn’t always equal the best one.
For more details on the Tay incident, you can read the full article on IEEE Spectrum: In 2016, Microsoft’s Racist Chatbot Revealed the Dangers of Online Conversation.
In Supervised Learning, the model is trained on a labeled dataset, where each input is paired with the label. The model learns to map inputs to outputs by minimizing the difference between its predictions and the actual labels using a loss function. But what if the training data contains biases? The model will learn and potentially amplify those biases. For example, if an AI model trained on a biased dataset is used to do a pre-analysis on CVs, it can eliminate candidates based on their race or gender.
So techniques like model interpretability are crucial to help us understand what’s behind the decision making processes of AI models and react accordingly to mitigate them.
Looking for alignment is not a one time task but an evolving process. With artificial Intelligence models getting more complex and advanced each day, keeping them aligned will become harder and harder.
Define Explicit Goals: Setting clear and explicit goals is the key to preventing misalignment in AI models. These goals need to take into account the ethical, social, and safety aspects of the environment where the AI will work.
Each of the following techniques offers a different approach to interpreting models, from understanding the global impact of features to analyzing local effects and explaining specific, individual predictions:
Interpretability: Applying methods to increase the interpretability of AI systems is crucial. This allows developers and users to better understand how the AI reaches its conclusions, making it easier to spot potential misalignments early on.
Looking for alignment is not a one time task but an evolving process.
Consider a model that predicts house prices based on multiple features like square meters, number of bedrooms, number of bathrooms and number of doors.
Using the methods above would not only help you to detect possible ridiculous behaviors from the model (e.g the more doors the higher price even with the same square footage) but also detect more subtle behaviors.
The causes of deviations like these can go from data issues (like biased data in the dataset) to model issues (e.g. caused by overfitting).