Abstract:In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as "safeguards" or "guardrails", has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.

What problem does this paper attempt to address?

The paper aims to address the security issues faced by large language models (LLMs) in practical applications. With the rapid development and widespread use of generative AI technologies (such as ChatGPT), the risks associated with LLMs have become increasingly prominent, including but not limited to ethical use, data bias, privacy leakage, and robustness. These issues are not limited to societal concerns, such as malicious actors using LLMs to spread false information or engage in criminal activities; in the field of scientific research, ethical considerations and risk control in professional contexts also need to be taken into account. To address these issues, the paper proposes a method for constructing a "guardrail" mechanism. A "guardrail" is an algorithm used to monitor and filter the inputs and outputs of trained LLMs to ensure their behavior falls within a predefined safe range. For example, if content related to child exploitation is detected, the "guardrail" can block these inputs or adjust the outputs to make them harmless. However, constructing effective "guardrails" faces the challenge of defining their requirements, as AI regulations may vary across different countries and regions, and in corporate environments, data privacy requirements may differ from those in the public domain. Additionally, "guardrails" need to handle potential conflicts between multiple requirements, such as hallucinations, toxicity, and fairness. The main goal of the paper is to systematically review the current state of "guardrail" mechanisms through a literature review and discuss their main challenges and improvement methods. Specifically, the paper covers the following content: 1. Understanding existing "guardrail" frameworks and the techniques for evaluating, analyzing, and enhancing specific desired attributes; 2. Exploring the techniques for bypassing these "guardrails" and the corresponding defensive measures; 3. Discussing the approaches to achieving comprehensive "guardrail" solutions, including design issues for specific application scenarios. Overall, this paper aims to provide developers with a comprehensive perspective on the importance of "guardrail" mechanisms and their implementation details, thereby promoting the application and development of safer, more reliable, and ethically compliant LLMs.

Safeguarding Large Language Models: A Survey

Building Guardrails for Large Language Models

Current state of LLM Risks and AI Guardrails

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas: A Survey

Large Language Model Supply Chain: Open Problems From the Security Perspective

Challenges in Guardrailing Large Language Models for Science

A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation

Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine

Evaluating Large Language Models: A Comprehensive Survey

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

On Protecting the Data Privacy of Large Language Models (LLMs): A Survey

Recent Advances in Attack and Defense Approaches of Large Language Models

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

Exploring Vulnerabilities and Protections in Large Language Models: A Survey

Exploring Vulnerabilities and Threats in Large Language Models: Safeguarding Against Exploitation and Misuse

Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks

Exploring Advanced Methodologies in Security Evaluation for LLMs

Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices

ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming