Abstract:The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. A curated list of related papers has been publicly available at <a class="link-external link-https" href="https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers" rel="external noopener nofollow">this https URL</a>.

Supporting Human-AI Collaboration in Auditing LLMs with LLMs

Auditing large language models: a three-layered approach

LLMAuditor: A Framework for Auditing Large Language Models Using Human-in-the-Loop

She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and Sustainable Language Models

Output Scouting: Auditing Large Language Models for Catastrophic Responses

Developing a framework for auditing large language models using human-in-the-loop

Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards

An Auditing Test To Detect Behavioral Shift in Language Models

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

Auditing the Use of Language Models to Guide Hiring Decisions

AuditLLM: A Tool for Auditing Large Language Models Using Multiprobe Approach

When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

Are You Human? An Adversarial Benchmark to Expose LLMs

Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming

Integrating Emotional and Linguistic Models for Ethical Compliance in Large Language Models

Informed AI Regulation: Comparing the Ethical Frameworks of Leading LLM Chatbots Using an Ethics-Based Audit to Assess Moral Reasoning and Normative Values

Can We Trust AI Agents? An Experimental Study Towards Trustworthy LLM-Based Multi-Agent Systems for AI Ethics

Large Language Model Safety: A Holistic Survey