Interpretability by AI: The Problem of Alignment with Human Values

Posted by Peter Rudin on 28. June 2024 in Essay

AI Alignment                    Credit:


Sam Altman’s ‘Boom vs. Doom’ dual approach to AIs impact is confusing and a recent interview during the International Telecommunication Union Conference in Geneva reinforced this impression. After being asked how his company’s large language models (LLM) really function, he said that OpenAI certainly has not solved interpretability,  essentially saying the company has yet to figure out how to trace back its AI models’ often bizarre and inaccurate output and the decisions they make to arrive at these answers. Moreover, aligning these models to human values has started an intensive discussion among the AI research community as several methods and procedures are suggested as to how to solve this problem.

The Difference between Explainability and Interpretability

To help organizations determine a Machine Language (ML) strategy to meet their needs, it is important to consider the difference between explainability and interpretability.

Interpretability:  If a business demands high model transparency and wants to understand exactly why and how the model is generating predictions, it needs to observe the inner mechanics of the AI-based ML method. This leads to interpreting the model’s weights and features to determine the given output. For example, if economists want to build a multivariate regression model to predict an inflation rate, they can view the estimated parameters of the model’s variables to measure the expected output given different data examples. In this case, full transparency is given, and the economist can answer the exact why and how of the model’s behaviour. However, high interpretability typically comes at the cost of performance.

Explainability: Explainability is how to take an ML model and explain the behaviour in human terms. With complex models one cannot fully understand how and why the inner mechanics impact the prediction. However, through so-called model agnostic methods one can discover meaning between input data attributions and model outputs, which enables users to explain the nature and behaviour of the ML model. For example, a news media outlet might use a neural network to assign categories to different articles. The network cannot interpret the model in depth. However, it can apply the model agnostic approach to evaluate the input article data versus the predictions made by the model. With this approach the model is assigning the Sports category to business articles that mention sport organizations. Although the news media did not use the process of model interpretability, it was nevertheless able to derive an explainable answer to reveal the model’s behaviour.

Challenges of AI Alignment

Alignment is a challenging problem to which there is currently no generally agreed upon solution. Some of the challenges of alignment include the following:

The black box problem. AI-systems are usually black boxes. There is no way to open them up and see exactly how they work. Black box AI-systems take input, perform an invisible computation and return an output. AI testers can change their inputs and measure patterns in output, but it is usually impossible to see the exact calculation that creates a repeatable output. Explainable AI can be programmed to share information that guides user input, but it remains a black box.

Power-seeking behaviour. AI-systems might independently gather resources to achieve their objectives. An example of this would be an AI-system avoiding being turned off by making copies of itself on another server without its operator knowing.

Stop-button problem. An AI-system might actively resist being stopped or shut off to achieve its programmed objective. This is like rewarding hacking because it prioritizes the reward from the literal goal over the preferred outcome. For example, if an AI-system’s primary objective is to make paper clips, it will avoid being shut off because it can’t make paper clips if it is shut off.

Defining values. Defining values and ethics for an AI-system is likely to represent the biggest challenge. There are many value systems, hence an agreement needs to be made among the user community on what those values should be and how to implement them.

Different Approaches to Solving  the Alignment Problem

Approaches to alignment are either technical or normative. Technical approaches to alignment deal with getting a machine to align with a predictable, controllable objective, such as making paper clips or producing news. Normative alignment is concerned with the ethical and moral principles embedded in AI-systems. The two approaches are interrelated and complex to manage. It is crucial for an organization to have a full understanding of the AI decision-making processes with model monitoring and accountability. Explainable AI can help humans to understand deep learning and its related algorithms or the output of neural networks. Different AI providers also take different approaches to solving AI alignment problems. OpenAI, for example, aims to train AI systems to do alignment research for better understanding alignment issues. Google’s DeepMind  has a highly skilled team dedicated to solving the alignment problem. Many organizations, whether they are commissioned to act as watchdogs or organizations working on standards such as government authorities, agree that AI alignment is an important goal and steps need to be taken to regulate its impact. The Future of Life Institute is one of the nonprofit organizations which helped create a list of guidelines to implement AI alignment, also referred to as the Asilomar AI Principles. These principles are divided into three categories: alignment research, value of ethics and longer-term issues. One of the principles mentioned is that highly autonomous AI systems should be designed in such a way that their goals and behaviours can be assured to align with human values throughout their operation. However, in contrast to these positive principles, the Institute also warns of aligning AI to war scenarios with systems designed to act as killer-robots, for example.

The Dangers If AI Is  Not Aligned With Human Values

Today we have reached the point where machines are capable of making decisions for us. According to Oxford philosopher Nick Bostrom, if one tells an intelligent machine to make paper clips, it might eventually destroy the whole world in its quest for raw materials to be turned into paperclips. The principle behind his theory is that the AI-system has no concept about the value of human life and no concept that materials are too valuable to be turned into paper clips. The killing of a pedestrian caused by a collision involving an autonomous car in Arizona in 2018 provides an example as to how misalignment can cause damage in a real-life situation.  When the National Transportation Safety Board investigated what had caused the collision between the Uber test vehicle and the pedestrian pushing a bicycle across the road, they found that the AI controlling the car had no awareness of the concept of jaywalking. It was totally unprepared to deal with a person being in the middle of the road, where she or he should not have been. On top of this, the system was trained to rigidly segment objects on the road into a number of categories such as other cars, trucks, cyclists, and pedestrians. A human being pushing a bicycle did not fit into any of those categories and did not behave in a way that would have been expected. The fact that ML attempts to replicate human learning that would never be encountered in real life might compound the problems further. Moreover, the complexity of interpreting data by the human brain to make decisions in real-time implies that there is no reference point, and the AI-system is likely to continue making more and more mistakes in a series of ‘cascading failures’.


As AI systems become more powerful it will be essential that its embedded principles align with humans’ goals, ethics and values. A panel of 75 experts recently concluded in a landmark scientific report ,commissioned by the UK government, that AI developers “understand little about how their systems operate” and that their scientific knowledge about alignment is very limited. They recommend that explanation and interpretability techniques can improve researchers’ and developers’ understanding of how general-purpose AI systems operate, but this research is nascent, the report states. Some AI companies are trying to find new ways to open the black box by mapping the artificial neurons of their algorithms to better understand the functionality of the human brain and its decision-making processes. Time will tell if this really helps to solve the alignment problem.

Leave a Reply

Your email address will not be published. Required fields are marked *