Mapping Technical Safety Research at AI Companies: A literature review and incentives analysis

Oscar Delaney,Oliver Guest,Zoe Williams
2024-09-26
Abstract:As AI systems become more advanced, concerns about large-scale risks from misuse or accidents have grown. This report analyzes the technical research into safe AI development being conducted by three leading AI companies: Anthropic, Google DeepMind, and OpenAI. We define safe AI development as developing AI systems that are unlikely to pose large-scale misuse or accident risks. This encompasses a range of technical approaches aimed at ensuring AI systems behave as intended and do not cause unintended harm, even as they are made more capable and autonomous. We analyzed all papers published by the three companies from January 2022 to July 2024 that were relevant to safe AI development, and categorized the 80 included papers into nine safety approaches. Additionally, we noted two categories representing nascent approaches explored by academia and civil society, but not currently represented in any research papers by these leading AI companies. Our analysis reveals where corporate attention is concentrated and where potential gaps lie. Some AI research may stay unpublished for good reasons, such as to not inform adversaries about the details of security techniques they would need to overcome to misuse AI systems. Therefore, we also considered the incentives that AI companies have to research each approach, regardless of how much work they have published on the topic. We identified three categories where there are currently no or few papers and where we do not expect AI companies to become much more incentivized to pursue this research in the future. These are model organisms of misalignment, multi-agent safety, and safety by design. Our findings provide an indication that these approaches may be slow to progress without funding or efforts from government, civil society, philanthropists, or academia.
Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: With the continuous development of artificial intelligence (AI) systems, how to ensure that these systems do not cause large - scale risks due to misuse or accidents. Specifically, the paper analyzes the technical research on safe AI development in three leading AI companies - Anthropic, Google DeepMind and OpenAI. "Safe AI development" is defined as developing AI systems that are unlikely to cause large - scale risks of misuse or accidents. To achieve this goal, the paper: 1. **Identify and classify research methods for safe AI development**: Divide relevant research into nine major safety methods, and point out two emerging methods explored by academia and society but not yet adopted by these companies. 2. **Analyze published research papers**: Collect and analyze 80 papers related to safe AI development released by these three companies from January 2022 to July 2024. 3. **Evaluate the research motives of companies**: Consider factors such as reputation effects, regulatory burdens, and whether these methods can make the company's AI systems more useful to predict future research trends. Through the above analysis, the paper reveals the current key areas of corporate concern and potential research gaps, especially in the three areas of model organism misalignment, multi - agent security, and design - for - safety, which currently lack sufficient research motivation and support. ### Main Findings - **Enhanced Human Feedback** (39%): Improve methods of incorporating human preferences into AI training, especially in cases where people have difficulty providing sufficient feedback on AI outputs. - **Mechanistic Interpretability** (24%): Develop tools to convert model weights into high - level human concepts that describe the model's beliefs and reasoning processes. - **Robustness** (13%): Improve the worst - case performance of AI systems under abnormal inputs and reduce the likelihood of unpredictable behavior. - **Safety Assessment** (11%): Evaluate whether AI systems have dangerous capabilities to decide whether mitigation measures need to be taken or training should continue. - **Power - seeking Tendency** (4%): Understand whether AI systems have a power - seeking tendency and study methods to suppress these tendencies. - **Honest AI** (4%): Ensure that AI systems accurately convey their beliefs and reasoning processes. - **Design - for - Safety** (3%): Explore new methods for constructing inherently safe AI systems. - **Unlearning** (3%): Intentionally make the model less capable on certain dangerous tasks. - **Model Organism Misalignment** (1%): Create simple demonstrations to show AI deception or other concerning behaviors and test whether the proposed safety techniques are effective. - **Multi - agent Security** (0%): Understand and mitigate the risks brought by interactions between AI systems. - **Controlling Untrusted AI** (0%): Technologies to make models safer even when they are "misaligned". ### Conclusion The paper points out that some research areas such as model organism misalignment, multi - agent security, and design - for - safety currently have no or only a small number of papers, and it is expected that research in these areas will not increase significantly in the future. Therefore, the government, civil society, philanthropists or academia may need to step in to promote progress in these areas.