Safeguarding Large Language Models: A Survey

Yi Dong,Ronghui Mu,Yanghao Zhang,Siqi Sun,Tianle Zhang,Changshun Wu,Gaojie Jin,Yi Qi,Jinwei Hu,Jie Meng,Saddek Bensalem,Xiaowei Huang
2024-06-04
Abstract:In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as "safeguards" or "guardrails", has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the security issues faced by large language models (LLMs) in practical applications. With the rapid development and widespread use of generative AI technologies (such as ChatGPT), the risks associated with LLMs have become increasingly prominent, including but not limited to ethical use, data bias, privacy leakage, and robustness. These issues are not limited to societal concerns, such as malicious actors using LLMs to spread false information or engage in criminal activities; in the field of scientific research, ethical considerations and risk control in professional contexts also need to be taken into account. To address these issues, the paper proposes a method for constructing a "guardrail" mechanism. A "guardrail" is an algorithm used to monitor and filter the inputs and outputs of trained LLMs to ensure their behavior falls within a predefined safe range. For example, if content related to child exploitation is detected, the "guardrail" can block these inputs or adjust the outputs to make them harmless. However, constructing effective "guardrails" faces the challenge of defining their requirements, as AI regulations may vary across different countries and regions, and in corporate environments, data privacy requirements may differ from those in the public domain. Additionally, "guardrails" need to handle potential conflicts between multiple requirements, such as hallucinations, toxicity, and fairness. The main goal of the paper is to systematically review the current state of "guardrail" mechanisms through a literature review and discuss their main challenges and improvement methods. Specifically, the paper covers the following content: 1. Understanding existing "guardrail" frameworks and the techniques for evaluating, analyzing, and enhancing specific desired attributes; 2. Exploring the techniques for bypassing these "guardrails" and the corresponding defensive measures; 3. Discussing the approaches to achieving comprehensive "guardrail" solutions, including design issues for specific application scenarios. Overall, this paper aims to provide developers with a comprehensive perspective on the importance of "guardrail" mechanisms and their implementation details, thereby promoting the application and development of safer, more reliable, and ethically compliant LLMs.