Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

Zhichen Dong,Zhanhui Zhou,Chao Yang,Jing Shao,Yu Qiao
2024-03-27
Abstract:Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at:
Computation and Language,Artificial Intelligence,Computers and Society,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores the security issues of Large Language Models (LLMs) in conversational applications, with a particular focus on how to prevent these models from generating harmful information. As LLMs demonstrate powerful capabilities in various conversational contexts, the risk of their misuse in producing harmful responses has also raised significant societal concern. The paper points out that LLMs could be used for fraudulent activities, cyber-attacks, spreading toxic content, perpetuating discriminatory biases, and disseminating false information. To address these issues, the paper provides a comprehensive review of LLM conversational security, covering three key aspects: attacks, defenses, and evaluations. In terms of attacks, it studies two main types of methods: inference-time attacks and training-time attacks. Inference-time attacks induce the model to produce unsafe responses by constructing adversarial prompts, while training-time attacks compromise the model's security by modifying model weights or injecting malicious data. Regarding defense strategies, the paper discusses safe alignment, inference guidance, and filtering methods. Safe alignment aims to enhance the intrinsic safety capabilities of pre-trained models through fine-tuning; inference guidance uses techniques such as system prompts to further improve the model's safety performance; input/output filters are used to detect and block malicious inputs or outputs. Finally, in the evaluation section, the paper introduces a range of datasets and metrics for measuring the safety of LLMs, including attack success rates and other fine-grained metrics. The goal of the paper is to provide a structured summary to deepen the understanding of LLM conversational safety and to encourage further research in the field. In summary, this paper conducts an in-depth analysis of LLM conversational safety, aiming to promote research and development in the field and ensure that these powerful language models can be more secure and reliable in practical applications.