logo

67 pages 2 hours read

Brian Christian

The Alignment Problem: Machine Learning and Human Values

Nonfiction | Book | Adult | Published in 2020

A modern alternative to SparkNotes and CliffsNotes, SuperSummary offers high-quality Study Guides with detailed chapter summaries and analysis of major themes, characters, and more.

Part 3Chapter Summaries & Analyses

Part 3: “Normativity”

Part 3, Chapter 7 Summary: “Imitation”

Chapter 7 starts with a discussion of mimicry in primates. Christian notes that, historically, primates have been considered skilled imitators, a notion supported by 19th-century scientists such as George John Romanes. However, contemporary research challenges this view. Studies by Elisabetta Visalberghi and others have found little evidence of spontaneous imitation among non-human primates, except under human training, suggesting that, contrary to traditional views, humans are the main imitators. The innate ability to imitate plays a crucial role in learning and forming social connections from an early age.

Christian explores the surprising tendency of humans to over-imitate, a phenomenon where individuals replicate both necessary and unnecessary actions observed in others. Researchers note that this behavior, more common in humans than in chimpanzees, contradicts expectations as chimpanzees tend to ignore irrelevant actions. Studies revealed that even when human children could identify unnecessary steps in a task, they replicated them anyway. This pattern persists despite explicit instructions to avoid redundant actions, indicating that over-imitation may stem from a sophisticated judgment of the demonstrator’s intentions and how their mind works.

Citing contemporaneous research and experiments, Christian notes that imitation in humans offers three key advantages over other learning methods like trial and error or direct instruction. It enhances efficiency by leveraging others’ experiences, ensures safety by minimizing risky failures, and facilitates learning complex actions that are difficult to verbally describe. This method has profoundly influenced AI development, demonstrating that watching and replicating can be a powerful learning tool.

Christian turns to the example of Chuck Thorpe, a robotics doctoral graduate from Carnegie Mellon, who was recruited in 1984 to develop autonomous vehicles for The Defense Advanced Research Projects Agency (DARPA). Early autonomous technology was rudimentary and slow, requiring extensive computational support and revisions to increase mobility. This led to the development of the Navlab 1 and later Autonomous Land Vehicle in a Neural Network (ALVINN), a neural network-based system that learned driving by imitating human input, which significantly advanced autonomous driving technology.

Three decades later, in 2009, Stéphane Ross, a Carnegie Mellon student, was training a neural network using a racing game called SuperTuxKart to learn driving behaviors by imitating his own actions. Despite recording extensive gameplay, the system struggled with unpredicted situations not covered in the training data. These types of challenges in imitation learning are coined “cascading errors.” These errors occur because the system, only trained on successful maneuvers, cannot adapt to or recover from mistakes. Ross’s experiments led to the development of interactive training methods, allowing the neural network to learn from both successes and mistakes. Different research teams are training neural networks on datasets taken from different environments, such as train hiking and busy city street driving.

Christian discusses the ramifications of imitation as a learning strategy. Imitation, he notes, can lead to failure when the imitator lacks the expertise to complete the actions they replicate. Such an idea is exemplified by Garry Kasparov, who argues that chess players who memorize moves without understanding them become stuck and memory depleted. This issue reflects a broader ethical and philosophical debate about whether actions should be judged based on potential best outcomes (called possibilism) or realistic expectations of outcomes (called actualism). The challenge highlights the limits of imitation, suggesting that while it can be a starting point for learning, it must be coupled with a deeper understanding and adaptation.

Another challenge of imitation-based learning is its limitation on surpassing the teacher, as observed by machine learning pioneer Arthur Samuel, who was working for IBM in 1959. Despite developing a checkers-playing system that could beat him using only the strategies he programmed, Samuel recognized the system’s incapacity to innovate beyond his teachings. This realization highlighted a significant obstacle: To improve, the system would need to generate its own strategic insights, a capability that was unattainable at the time. Consequently, the evolution of machine learning has consistently grappled with this dependency on human input. In 2015, AlphaGo, a learning system developed by the company DeepMind and following the same training strategy as IBM’s Deep Blue (the chess system that famously beat Gary Kasparov), learned to perform in ways that significantly surpassed previous models. The subsequent model, AlphaGo Zero, was no longer fed the human information. Nevertheless, the model was highly successful, as it learned to imitate and predict information based on its own operation, therefore transcending its own limits.

Researchers in philosophy and computer science are concerned with two main issues regarding the creation of advanced autonomous systems. First, expressing complex human behaviors and values in a programmable format is extremely difficult, as noted by Nick Bostrom, a researcher at Future of Humanity Institute, because it is almost impossible to list everything humans value. Therefore, it is impossible to teach AI all the data and its ramifications through imitation. Secondly, relying solely on humans for learning models limits systems, as humans may not be the best sources of moral authority or able to demonstrate desired outcomes effectively, according to Blaise Agüera y Arcas and other researchers. Christian argues that techniques like imitation learning may initially guide systems but need refinement to handle real-world complexities and moral judgments without clear external success metrics.

Part 3, Chapter 8 Summary: “Inference”

Chapter 8 starts with a 2006 study undertaken by University of Michigan psychologist Felix Warneken, which demonstrated that toddlers can intuitively recognize and solve problems without direct encouragement or reward—a behavior that suggests a deep-seated propensity for cooperation and altruism. This insight, explored alongside Michael Tomasello, contrasts with other primates, which exhibit conditional and limited helping behaviors. Their findings underscore human uniqueness in social cognition, emphasizing humans’ natural inclination to collaborate and assist, an aspect that AI development might emulate through observing and learning from human actions.

In the 1990s, Stuart Russell, a UC Berkeley researcher, reflected on the uniformity of human walking patterns across different cultures and times. He suggested that the reason for this uniformity might not be simply imitation but a physical optimization of some sort, although the exact reason remained elusive. His inquiry led to the concept of inverse reinforcement learning, proposing a method to deduce the motivations behind observed behaviors, an approach that could advance AI development significantly by understanding and aligning machine actions with human intentions.

Inverse Reinforcement Learning (IRL) explores how actions can imply optimal behaviors under unknown reward conditions. Stuart Russell and Andrew Ng demonstrated IRL’s potential by modeling simple tasks and assuming perfect, non-random actions to infer straightforward goals. This concept evolved to model complex behaviors, like driving, where AI inferred human-like goals from observed actions without needing explicit instructions. Later, advancements enabled robots to perform intricate tasks by inferring intentions from demonstrations, suggesting that IRL can potentially teach machines to understand and replicate human values and complex behaviors beyond basic programmed instructions.

IRL infers complex goals within systems by observing expert demonstrations, although it depends on experts being available. The advantages of this technique are most significant in areas that are difficult to explicitly program. Examples include piloting helicopters for complex stunts and driving taxis where a human expert’s behavior provides the necessary data for learning. The challenge lies in whether systems can deduce reward functions solely from feedback without explicit demonstrations, which could revolutionize how machines align with human intentions, especially when direct programming of complex tasks is unfeasible. Researchers like Jan Leike, Paul Christiano, and Dario Amodei tested inverse reinforcement learning without direct demonstrations, using a virtual environment and feedback obtained through human-selected video clips. This method used the same reward system as in human behavior, aligning machine behavior with human judgments. The experiment extended to complex scenarios like the video game Enduro, where the system outperformed traditional methods.

In 2013, Dylan Hadfield-Menell joined Stuart Russell’s research lab at UC Berkeley to explore the ethical alignment of AI. They worked on developing cooperative inverse reinforcement learning (CIRL) to ensure AI systems align with human values, a shift toward collaborative human-AI interaction that built on Russell’s earlier work in inverse reinforcement learning. This approach sought to redefine AI development to prioritize human objectives.

Russel’s research lab, together with others working in the field, such as roboticist Anca Drăgan, introduced advanced cooperative frameworks in machine learning by borrowing insights from developmental psychology, education, and human-computer interaction. This approach recognizes that both humans and machines can benefit from understanding each other’s intentions, improving interaction. The methodology emphasizes mutual learning and adjustment, similar to pedagogical techniques used in human parenting and teaching.

To close Chapter 8, Christian discusses the integration of machine learning with human behavioral insights from fields like developmental psychology and education to enhance cooperation between humans and machines. This cooperative framework allows machines to interpret and learn from human actions and feedback more effectively. As machines become more embedded in daily lives, understanding and shaping their behavior becomes crucial. Christian advises readers to be more mindful of their actions online, as they should keep in mind that AI is constantly learning from their actions (such as browsing), therefore one should be aware of which behaviors should be reinforced.

Part 3, Chapter 9 Summary: “Uncertainty”

Chapter 9 starts with the account of Stanislav Petrov, a Soviet officer supervising Oko (a satellite system used for warning), and the crisis he faced in 1983 when satellite warnings falsely indicated a US missile attack. Despite protocol urging immediate action, Petrov—who suspected a system error due to the unlikely scale of the attack—chose to report it as a false alarm. His intuition was correct; it was merely a malfunction caused by sunlight reflections, avoiding a potential retaliatory nuclear strike.

Despite the perceived reliability of warning systems like Oko, inconsistencies such as identifying random static as objects with high confidence have revealed flaws, specifically in deep-learning systems. These systems, trained only on defined categories, fail to recognize images that do not fall in any category (pixels without any meaning) effectively, pointing to a significant challenge in AI known as the “open category problem” (280). This issue was exemplified in a project led by Thomas Dietterich, where a system trained to identify specific insect species inaccurately classified non-insect objects in black and white, as the lack of color stalled the system’s identification process.

Christian discusses the fact that contemporary computer vision systems are highly specialized but suffer from a major limitation: They are trained to recognize only predefined categories, leading them to misclassify or overconfidently identify unfamiliar images. These systems lack a “none of the above” option and often make incorrect classifications with high confidence (281). Yarin Gal, a researcher of Oxford Applied and Theoretical Machine Learning Group, Oxford professor, and NASA scholar, emphasizes the importance of incorporating uncertainty into machine learning models. Researchers argue for the use of Bayesian principles to better manage and represent uncertainty. This approach could transform systems by allowing them to acknowledge when they do not recognize an input, enhancing reliability in practical applications like medicine and autonomous driving.

In 2017, an unidentified, unconscious man was brought into Jackson Memorial Hospital in Miami with a “Do Not Resuscitate” tattoo on his chest. Initially, Dr. Gregory Holt considered ignoring the tattoo due to the uncertainties it presented. As the man’s condition declined, ethical dilemmas intensified, leading to consultation with an ethics committee. They decided to honor the tattoo’s directive after verifying it mirrored an official request on file, and ultimately chose not to intervene further. The patient died, raising complex questions about courses of action that could be irreversible. AI safety research faces similar challenges in defining and managing the consequences of high-impact actions. Researchers like Stuart Armstrong and Victoria Krakovna are exploring how AI systems can avoid irreversible outcomes, even in trivial actions, by developing frameworks that prioritize minimizing potential negative impacts in uncertain situations.

In a 1960 article, MIT researcher Norbert Wiener articulated a foundational concept in AI safety, emphasizing the necessity of precise intent in machine programming. Wiener warned that once machines are activated, their operations become difficult to alter, making it crucial that their programmed purposes truly align with our intentions, not merely approximate them. This early insight into the alignment problem expresses a complex challenge: ensuring that AI’s actions strictly adhere to its human-defined goals. Additionally, this introduces the concept of “corrigibility”—the ability to correct or modify an AI’s course of action when necessary. Other researchers, such as those working at the Machine Intelligence Research Institute and the Future of Humanity Institute studied AI’s corrigibility by focusing on incentives. They experimented with balancing incentives for AI to allow goal modification, finding early strategies unsatisfactory but enlightening for future studies. They suggested that embracing uncertainty might be more effective than manipulating incentives, proposing a system that remains aware of its potential flaws and incompleteness. This idea aligns with concurrent findings from Berkeley, advocating for AI systems that prioritize human input.

Christian ends Chapter 9 with a comparison between Catholic theological debates and modern AI ethical considerations, pointing to an overall uncertainty in defining moral actions in both the Catholic and the machine contexts. The author draws an analogy between Catholic interpretations of sin, which struggle to balance strict adherence with more permissive attitudes, and modern dilemmas in AI on handling conflicting ethical theories. Just as theologians debated the sinfulness of actions like eating fish on Fridays, AI ethicists debate the parameters for machine actions under moral uncertainty. This historical context emphasizes the ongoing challenge of achieving ethical consensus, whether in religious doctrine or AI programming, suggesting that the complexities of guiding actions are rooted in current uncertain moral standards.

Part 3 Analysis

Chapters 7 to 9 focus on the concepts of imitation, inference, and uncertainty. Three crucial areas illustrate the complexities of aligning AI systems with human intentions and ethical norms. Christian draws on research spanning decades, with concrete effects on contemporary society.

Chapter 7 discusses the role of imitation in AI learning processes, drawing parallels with human learning behaviors, pointing to Christian’s thematic interest in The Intersection of Human and Machine Learning. Imitation, as a fundamental strategy in AI, serves as a double-edged sword. On one hand, it allows AI systems to learn efficiently from human examples, capturing complex behaviors that might be hard to program directly. This approach has been particularly influential in developing technologies such as autonomous vehicles, where AI systems learn to drive by mimicking human behavior.

Christian cites both concrete and theoretical examples to demonstrate the significant limitations introduced by machine imitation of human behavior, primarily due to the quality of the examples followed. If AI systems strictly imitate their human trainers, they inherit not only their skills but also their flaws. Furthermore, imitation restricts an AI’s ability to surpass its teachers, as it relies on existing knowledge rather than creating novel solutions. This challenge was notably highlighted by the early AI systems that played checkers or chess, which could not innovate beyond the strategies programmed by their creators, despite being successful at defeating the highest human performers in the field, such as Gary Kasparov in chess.

Christian provides an overview of the documented human tendency toward over-imitation to illustrate that AI might not only adopt inefficient behaviors but could also perpetuate them. This issue raises concerns about the depth of understanding that AI systems achieve through imitation alone. It calls for AI to develop a capacity for critical evaluation, allowing it to discern which actions are worth replicating and which should be discarded—an ability of which humans are capable but which cannot be easily taught or imitated by AI.

Inference, as Christian discusses in Chapter 8, represents a more sophisticated level of learning where AI systems attempt to understand and internalize the underlying intentions or values behind human actions. Christian gives a key example of this approach in his discussion of inverse reinforcement learning (IRL), which allows AI to deduce the rewards or goals implicit in observed behaviors. Such capabilities enable AI systems to perform complex tasks that require a deep understanding of human preferences and intentions, such as piloting helicopters or driving in unpredictable urban environments.

Christian acknowledges that ensuring a machine will infer human values accurately is fraught with difficulties. Human values are diverse, context-dependent, and often implicitly communicated. The challenge for AI, Christian argues lies in interpreting these subtle cues without explicit instructions, necessitating advanced algorithms that can generalize from limited data while remaining flexible to new information. The development of cooperative inverse reinforcement learning (CIRL) marks a significant step toward this goal, promoting a collaborative model where AI not only learns from humans but also engages with them to refine its understanding and align its actions more closely with human objectives.

In Chapter 9, Christian provides the story of Stanislav Petrov as a reminder of the stakes involved in AI systems tasked with critical decisions, addressing the theme of uncertainty, a pervasive issue in AI that affects both operational reliability and ethical decision-making. The ability of AI to handle uncertainty—not just in recognizing when it does not know something but also in making decisions when data is incomplete or ambiguous underscores Christian’s exploration of the Ethical Implications of AI Use.

Christian argues that incorporating uncertainty into AI models, such as through Bayesian methods, also allows systems to better manage and represent uncertainty, potentially improving decision-making in critical applications like medicine and autonomous driving. However, dealing with ethical uncertainty involves not only technical solutions but also philosophical insight. AI systems must navigate complex moral landscapes where the right course of action is often unclear and the consequences of decisions can be significant and irreversible.

One of the ethical issues that the use of AI models in critical situations raises is the irreversibility of certain decisions. Thus, researchers emphasize the need to define notions like irreversibility and high-impact decision-making in relation to specific goals. Such notions, despite appearing to be intuitive for humans, do not have the same relation to reality from the point of view of the machine. Therefore, contemporary researchers prioritize approachable goals that center humans in the decision-making process.

To address both practical and ethical issues with AI, Christian emphasizes the necessity of Interdisciplinary Approaches to AI Development and Implementation, integrating insights from computer science, cognitive science, ethics, and other fields. For Christian, this comprehensive approach remains essential for developing AI systems that are not only technically proficient but also socially and ethically responsible.

blurred text
blurred text
blurred text
blurred text