Abstract:Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at:

What problem does this paper attempt to address?

The paper primarily explores the security issues of Large Language Models (LLMs) in conversational applications, with a particular focus on how to prevent these models from generating harmful information. As LLMs demonstrate powerful capabilities in various conversational contexts, the risk of their misuse in producing harmful responses has also raised significant societal concern. The paper points out that LLMs could be used for fraudulent activities, cyber-attacks, spreading toxic content, perpetuating discriminatory biases, and disseminating false information. To address these issues, the paper provides a comprehensive review of LLM conversational security, covering three key aspects: attacks, defenses, and evaluations. In terms of attacks, it studies two main types of methods: inference-time attacks and training-time attacks. Inference-time attacks induce the model to produce unsafe responses by constructing adversarial prompts, while training-time attacks compromise the model's security by modifying model weights or injecting malicious data. Regarding defense strategies, the paper discusses safe alignment, inference guidance, and filtering methods. Safe alignment aims to enhance the intrinsic safety capabilities of pre-trained models through fine-tuning; inference guidance uses techniques such as system prompts to further improve the model's safety performance; input/output filters are used to detect and block malicious inputs or outputs. Finally, in the evaluation section, the paper introduces a range of datasets and metrics for measuring the safety of LLMs, including attack success rates and other fine-grained metrics. The goal of the paper is to provide a structured summary to deepen the understanding of LLM conversational safety and to encourage further research in the field. In summary, this paper conducts an in-depth analysis of LLM conversational safety, aiming to promote research and development in the field and ensure that these powerful language models can be more secure and reliable in practical applications.

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Security and Privacy Challenges of Large Language Models: A Survey

Exploring Advanced Methodologies in Security Evaluation for LLMs

Exploring Vulnerabilities and Protections in Large Language Models: A Survey

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue

On Protecting the Data Privacy of Large Language Models (LLMs): A Survey

Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models

Conversational Complexity for Assessing Risk in Large Language Models

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Evaluating Large Language Models: A Comprehensive Survey