LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?

David Glukhov,Ilia Shumailov,Yarin Gal,Nicolas Papernot,Vardan Papyan
2023-07-20
Abstract:Large language models (LLMs) have exhibited impressive capabilities in comprehending complex instructions. However, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. Existing defence mechanisms, such as model fine-tuning or output censorship using LLMs, have proven to be fallible, as LLMs can still generate problematic responses. Commonly employed censorship approaches treat the issue as a machine learning problem and rely on another LM to detect undesirable content in LLM outputs. In this paper, we present the theoretical limitations of such semantic censorship approaches. Specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to LLMs' programmatic and instruction-following capabilities. Furthermore, we argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs from a collection of permissible ones. As a result, we propose that the problem of censorship needs to be reevaluated; it should be treated as a security problem which warrants the adaptation of security-based approaches to mitigate potential risks.
Artificial Intelligence,Computation and Language,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper explores the issue of censorship in large language models (LLMs) and points out the fundamental limitations of current censorship methods. Specifically, the paper attempts to address the following key issues: 1. **Limitations of Existing Censorship Mechanisms**: - Current censorship methods mainly rely on machine learning techniques, such as fine-tuning models or using another LLM to detect and filter inappropriate outputs. However, these methods have been proven to be easily bypassed. - Semantic censorship (i.e., censorship based on the meaning of content) is theoretically an undecidable problem, which means it is impossible to completely and accurately identify all inappropriate outputs through algorithms. 2. **Nature of the Censorship Problem**: - The authors argue that the censorship problem is not just a machine learning problem but a computer security problem. Existing censorship methods fail to fully consider the vulnerabilities that attackers might exploit. - Attackers can generate malicious content by combining multiple seemingly harmless outputs, a type of attack known as "Mosaic Prompts." 3. **Reevaluating Censorship Methods**: - Due to the limitations of existing censorship methods, the authors call for a reevaluation of censorship methods, viewing it as a security problem rather than a mere machine learning problem. - By drawing on standard methods from the field of computer security, such as access control and user monitoring, potential risks can be better managed and mitigated. ### Main Findings 1. **Theoretical Limitations of Semantic Censorship**: - By linking semantic censorship to undecidable problems in computational theory, the authors demonstrate that semantic censorship is theoretically infeasible. - For example, according to Rice's Theorem, for any non-trivial set of languages, determining whether a program belongs to that set is undecidable. 2. **Impossibility of Output Censorship**: - The authors further demonstrate that in practical applications, due to the invariance of string transformations, output censorship is also impossible to achieve. Even if the censorship mechanism can detect some inappropriate outputs, attackers can still restore these contents through inverse transformations. 3. **Mosaic Prompt Attacks**: - Mosaic prompt attacks exploit the instruction-following capabilities of LLMs by combining multiple harmless outputs to generate malicious content. This type of attack makes it difficult for existing censorship mechanisms to effectively prevent the generation of malicious content. ### Conclusion The paper emphasizes that the censorship problem of LLMs needs to be reexamined from a security perspective. By drawing on standard methods from the field of computer security, potential risks can be better managed and mitigated. Although the censorship problem is theoretically unsolvable, reasonable restrictions and security measures can still improve the safety and credibility of LLMs to a certain extent.