

While the connections aren't always explicit, a knowledgeable reader can connect the academic examples given in these chapters to the ideas of specification gaming and mesa optimization that we talk about frequently in this newsletter.

It then moves on to agency and reinforcement learning, covering from a more historical and academic perspective how we have arrived at such ideas as temporal difference learning, reward shaping, curriculum design, and curiosity, across the fields of machine learning, behavioral psychology, and neuroscience. The neural net that thought asthma reduced the risk of pneumonia The COMPAS controversy (leading up to impossibility results in fairness) The failure of facial recognition models on minorities This book starts off with an explanation of machine learning and problems that we can currently see with it, including detailed stories and analysis of:

This is an extended summary + opinion, a version without the quotes from the book will go out in the next Alignment Newsletter. In any case, the challenges of AI alignment are significant, says Christian, due to the inherent difficulties involved in translating fuzzy human desires into the cold, numerical logic of computers.The Alignment Problem: Machine Learning and Human Values, by Brian Christian, was just released. That problem boils down to the question of how to ensure AIs make decisions in line with human goals and values – whether you are worried about long-term existential risks, like the extinction of humanity, or immediate harms like AI-driven misinformation and bias. “The system will optimise what you actually specified, but not what you intended,” says Brian Christian, author of The Alignment Problem and a visiting scholar at the University of California, Berkeley. The scenario is absurd, of course, but illustrates a troubling problem: AIs don’t “think” like us and, if we aren’t extremely careful about spelling out what we want them to do, they can behave in unexpected and harmful ways. Bostrom suggested it could quickly decide that killing all humans was pivotal to its mission, both because they might switch it off and because they are full of atoms that could be converted into more paper clips. Imagine a superintelligent AI has been set the goal of producing as many paper clips as possible. This goes back to 2003, when Nick Bostrom, a philosopher at the University of Oxford, posed a thought experiment. WHAT do paper clips have to do with the end of the world? More than you might think, if you ask researchers trying to make sure that artificial intelligence acts in our interests.
