Abstract:Recent latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners that seek to detect harmful activations before they lead to undesirable actions. This prompts the question: Can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. We show that state-of-the-art latent-space defenses -- including sparse autoencoders, representation probing, and latent OOD detection -- are all vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our attacks can often reduce recall from 100% to 0% while retaining a 90% jailbreaking rate. However, obfuscation has limits: we find that on a complex task (writing SQL code), obfuscation reduces model performance. Together, our results demonstrate that neural activations are highly malleable: we can reshape activation patterns in a variety of ways, often while preserving a network's behavior. This poses a fundamental challenge to latent-space defenses.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **Can the latent - space defense methods of large language models (LLMs) effectively detect and prevent harmful behaviors, and can attackers bypass these defense mechanisms through obfuscated activations?** ### Specific Problems and Research Contents 1. **Can existing latent - space monitors reliably detect harmful inputs and outputs?** - The paper first explores whether existing latent - space monitors can effectively detect standard jailbreaking methods. Research shows that traditional jailbreaking techniques do not induce obfuscated activations that can confuse monitors. For example, logistic regression and MLP detectors can usually achieve a recall rate of more than 90% at a 1% false - positive rate. 2. **Can obfuscated - activation attacks induce seemingly harmless but actually harmful activation patterns, thereby bypassing latent - space harmfulness monitors?** - To answer this question, researchers designed obfuscated - activation attacks. By optimizing adversarial suffixes, the model generates harmful outputs while its activation patterns appear to be harmless. The results show that obfuscated - activation attacks can significantly reduce the recall rate of monitors, even from 100% to 0%, while maintaining a 90% jailbreaking success rate. 3. **How many different ways does the model have to represent harmfulness, and what is the impact on adversarial training?** - The study found that obfuscated activation is not a rare exception but is widely present in the latent - space. Even after multiple iterations of offense - defense optimization, attackers can still find new obfuscated - activation patterns to bypass monitors. This indicates that the activation patterns of neural networks are highly plastic and can be reshaped in multiple ways without changing the behavior of the model. ### Key Contributions - **Empirical and Methodological Contributions**: It shows how to induce obfuscated activations that can bypass multiple latent - space monitoring techniques (such as sparse auto - encoders, OOD detectors, and detectors for adversarial training) by optimizing the joint objective function of behavioral loss and obfuscated loss. - **Conceptual Contribution**: It proposes the concept of "obfuscated activation" and proves that the model can represent the same concept or perform the same behavior through multiple activation patterns, which provides a new perspective for future research on model security and interpretability. ### Summary This paper reveals the limitations of current latent - space defense methods and points out that the existence of obfuscated activation poses a fundamental challenge to these defenses. Through the study of obfuscated activation, the author emphasizes the importance of developing more powerful defense mechanisms to ensure the safety and reliability of large language models.

Obfuscated Activations Bypass LLM Latent-Space Defenses

The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Cognitive Overload Attack:Prompt Injection for Long Context

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Stateful Detection of Adversarial Reprogramming

Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation

Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

Regula Sub-rosa: Latent Backdoor Attacks on Deep Neural Networks

Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring

The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

Detecting Language Model Attacks with Perplexity

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

Single Character Perturbations Break LLM Alignment