Abstract:Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LM to detect undesirable content in LLM outputs. In this paper, we present the theoretical limitations of such semantic censorship approaches. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs from a collection of permissible ones. As a result, we propose that the problem of censorship needs to be reevaluated; it should be treated as a security problem which warrants the adaptation of security-based approaches to mitigate potential risks.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper explores the issue of censorship in large language models (LLMs) and points out the fundamental limitations of current censorship methods. Specifically, the paper attempts to address the following key issues: 1. **Limitations of Existing Censorship Mechanisms**: - Current censorship methods mainly rely on machine learning techniques, such as fine-tuning models or using another LLM to detect and filter inappropriate outputs. However, these methods have been proven to be easily bypassed. - Semantic censorship (i.e., censorship based on the meaning of content) is theoretically an undecidable problem, which means it is impossible to completely and accurately identify all inappropriate outputs through algorithms. 2. **Nature of the Censorship Problem**: - The authors argue that the censorship problem is not just a machine learning problem but a computer security problem. Existing censorship methods fail to fully consider the vulnerabilities that attackers might exploit. - Attackers can generate malicious content by combining multiple seemingly harmless outputs, a type of attack known as "Mosaic Prompts." 3. **Reevaluating Censorship Methods**: - Due to the limitations of existing censorship methods, the authors call for a reevaluation of censorship methods, viewing it as a security problem rather than a mere machine learning problem. - By drawing on standard methods from the field of computer security, such as access control and user monitoring, potential risks can be better managed and mitigated. ### Main Findings 1. **Theoretical Limitations of Semantic Censorship**: - By linking semantic censorship to undecidable problems in computational theory, the authors demonstrate that semantic censorship is theoretically infeasible. - For example, according to Rice's Theorem, for any non-trivial set of languages, determining whether a program belongs to that set is undecidable. 2. **Impossibility of Output Censorship**: - The authors further demonstrate that in practical applications, due to the invariance of string transformations, output censorship is also impossible to achieve. Even if the censorship mechanism can detect some inappropriate outputs, attackers can still restore these contents through inverse transformations. 3. **Mosaic Prompt Attacks**: - Mosaic prompt attacks exploit the instruction-following capabilities of LLMs by combining multiple harmless outputs to generate malicious content. This type of attack makes it difficult for existing censorship mechanisms to effectively prevent the generation of malicious content. ### Conclusion The paper emphasizes that the censorship problem of LLMs needs to be reexamined from a security perspective. By drawing on standard methods from the field of computer security, potential risks can be better managed and mitigated. Although the censorship problem is theoretically unsolvable, reasonable restrictions and security measures can still improve the safety and credibility of LLMs to a certain extent.

LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos

Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

Can LLMs Follow Simple Rules?

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities

Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

Can LLM-Generated Misinformation Be Detected?

Global Challenge for Safe and Secure LLMs Track 1

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

Emerging Security Challenges of Large Language Models

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Exploring the Adversarial Capabilities of Large Language Models

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation

Defending Against Social Engineering Attacks in the Age of LLMs

Large Language Models as Instruments of Power: New Regimes of Autonomous Manipulation and Control

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

The Ethics of Interaction: Mitigating Security Threats in LLMs