Toward a More Expansive Perspective on AI Safety
The discipline of AI Safety has gained considerable momentum recently, buoyed by media attention and a glut of funding commitments from AI companies and governments. But how can we ensure that AI Safety delivers on its promises to reduce present and future harms from advanced AI systems?
The fall Workshop on Sociotechnical AI Safety at Stanford (co-hosted by Stanford’s McCoy Family Center for Ethics in Society, Stanford Institute for Human-Centered Artificial Intelligence (HAI), and the MINT lab at Australian National University) aimed to make progress on this question, bringing together a diverse group of researchers from industry and academia. The workshop put AI Safety researchers in conversation with researchers whose work focuses on fairness, accountability, transparency, and ethics (FATE) in AI.
Inclusion and AI Safety
A common theme in much of the discussion was inclusion. Shazeda Ahmed (UCLA) kicked off the workshop by characterizing the epistemic community of AI Safety, highlighting its close ties with effective altruist, longtermist, and rationalist movements. The ideological overlaps in the AI Safety community, Ahmed argued, have allowed for enviably effective field-building and dissemination of information.
The price of homogeneity, though, is the risk of excluding non “in-group” voices from conversations about AI Safety. This point was made by Dylan Hadfield-Mennell (MIT), who showed that the very same formal arguments for taking AI Safety seriously (concerning proxy gaming in reinforcement learning) underline the necessity of broadening participation in AI’s development and deployment. Marie-Therese Png (Oxford) employed ideas from critical security studies to argue that only a fully global approach to participation can make AI safer, drawing attention to the international supply chains that underlie much progress on the hardware side of AI’s development.
What could inclusion in AI Safety look like? Model fine-tuning provides one example. Deep Ganguli (Anthropic) presented work showing that a virtually crowdsourced “constitution” could be used to guide model fine-tuning, with modest improvements on downstream bias benchmarks relative to a researcher-specified constitution. Similarly, Nahema Marchal and Iason Gabriel (Google DeepMind) presented work using focus groups composed of members from marginalized communities within the U.S. to generate a rule set governing preferred model behavior, which could be incorporated into model fine-tuning or deployment. As Nathan Lambert (AI2) observed, though, there is still significant uncertainty about how we should interpret the results of contemporary tuning techniques. In particular, one might worry that inclusion at the stage of fine-tuning occurs too late and may not remove normatively salient biases acquired during model pre-training.
Complicating the Conceptual Landscape
A key term in discussions about AI Safety is “alignment.” It refers to the compatibility between human interests and the functioning of AI technology, which, in some scenarios, could pursue its own interests at our expense. However, as various speakers stressed, there is no consensus on the definition of the term itself or on the right path toward alignment. More broadly, Shiri Dori-Hacohen (UConn) argued that current research wrongly presupposes that there is a single set of human desires and needs to which AI can be said to be “misaligned.”
What other frameworks can we use to think about alignment? Mark Riedl (Georgia Tech) proposed the notion of “normative alignment” (in contrast to value alignment): alignment understood as conformity of AI systems to the norms of our communities. As the audience pointed out, a question remains: how do we make sure we incorporate the norms that serve the interests of minority groups? Who determines these norms?
The very idea of AI Safety requires us to identify the relevant risks, which is often a challenging task. A useful tool in this respect, Shalaleh Rismani (MILA) suggested, may be the System Theoretic Process Analysis (STPA), a hazard analysis framework that starts precisely from the identification of harms and losses and thereby helps to establish accountability methods and protocols. STPA is also particularly suited, Rismani noted, to account for the evolving capabilities of ML systems. In turn, Tegan Maharaj (Toronto) argued for using deep risk mapping, which combines agent-based modeling with deep learning to simulate possible futures. Maharaj emphasized that deep risk mapping identifies usually underestimated risks, such as compounded harms for marginalized groups or feedback loops in climate change.
Looking Forward
The richness and variety of voices in the workshop enabled the identification of pressing research topics that remain underexplored in the AI safety literature, such as issues around AI’s environmental costs or its democratization.
Irene Solaiman (Hugging Face) noted that environmental evaluations are often limited to the carbon emissions of training, testing, and deploying these systems, yet the energy costs of manufacturing hardware remain underexplored. Further, as Solaiman stressed, the environmental impact of manufacturing goes beyond the impact of carbon emissions and includes effects on natural resources like water.
Regarding the democratization of AI, a notable omission from much work on AI Safety is community advocacy groups, which may bring significant knowledge to the table. Questions remain, though, about the precise role they ought to play in AI’s development. Relatedly, what feedback mechanisms could maintain people’s engagement with the set of rules that guide fine-tuning in the way that citizens remain engaged with the constitution of their countries via litigation or civil disobedience? As Rishi Bommasani (Stanford) suggested, ensuring greater transparency may be key here: in order to identify the most appropriate points for public engagement and intervention, we need to have greater insight into the full development pipeline.
In the words of co-organizer Seth Lazar (ANU), the workshop was proof positive that our best hope for setting normative goals for AI, like “safety,” is to integrate deep technical work with an equally robust understanding of how technology interacts with incentive structures and power relations in our societies. Lazar and other participants are already planning a series of related events aiming to foster and grow the multidisciplinary subfield of sociotechnical AI safety.
Want to know more details about this workshop? Check out the full report and workshop website.