Obfuscated Activations Bypass LLM Latent-Space Defenses

Luke Bailey,Alex Serrano,Abhay Sheshadri,Mikhail Seleznyov,Jordan Taylor,Erik Jenner,Jacob Hilton,Stephen Casper,Carlos Guestrin,Scott Emmons
2024-12-13
Abstract:Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **Can the latent - space defense methods of large language models (LLMs) effectively detect and prevent harmful behaviors, and can attackers bypass these defense mechanisms through obfuscated activations?** ### Specific Problems and Research Contents 1. **Can existing latent - space monitors reliably detect harmful inputs and outputs?** - The paper first explores whether existing latent - space monitors can effectively detect standard jailbreaking methods. Research shows that traditional jailbreaking techniques do not induce obfuscated activations that can confuse monitors. For example, logistic regression and MLP detectors can usually achieve a recall rate of more than 90% at a 1% false - positive rate. 2. **Can obfuscated - activation attacks induce seemingly harmless but actually harmful activation patterns, thereby bypassing latent - space harmfulness monitors?** - To answer this question, researchers designed obfuscated - activation attacks. By optimizing adversarial suffixes, the model generates harmful outputs while its activation patterns appear to be harmless. The results show that obfuscated - activation attacks can significantly reduce the recall rate of monitors, even from 100% to 0%, while maintaining a 90% jailbreaking success rate. 3. **How many different ways does the model have to represent harmfulness, and what is the impact on adversarial training?** - The study found that obfuscated activation is not a rare exception but is widely present in the latent - space. Even after multiple iterations of offense - defense optimization, attackers can still find new obfuscated - activation patterns to bypass monitors. This indicates that the activation patterns of neural networks are highly plastic and can be reshaped in multiple ways without changing the behavior of the model. ### Key Contributions - **Empirical and Methodological Contributions**: It shows how to induce obfuscated activations that can bypass multiple latent - space monitoring techniques (such as sparse auto - encoders, OOD detectors, and detectors for adversarial training) by optimizing the joint objective function of behavioral loss and obfuscated loss. - **Conceptual Contribution**: It proposes the concept of "obfuscated activation" and proves that the model can represent the same concept or perform the same behavior through multiple activation patterns, which provides a new perspective for future research on model security and interpretability. ### Summary This paper reveals the limitations of current latent - space defense methods and points out that the existence of obfuscated activation poses a fundamental challenge to these defenses. Through the study of obfuscated activation, the author emphasizes the importance of developing more powerful defense mechanisms to ensure the safety and reliability of large language models.