Abstract:Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at <a class="link-external link-http" href="http://github.com/llm-attacks/llm-attacks" rel="external noopener nofollow">this http URL</a>.

AnomaLLMy -- Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Anomaly Detection of Tabular Data Using LLMs

Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection

Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens

Where is the signal in tokenization space?

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Tokenization Falling Short: On Subword Robustness in Large Language Models

Large Language Model Tokenizer Bias: A Case Study and Solution on GPT-4o

Tokenizer Choice For LLM Training: Negligible or Crucial?

Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

Universal and Transferable Adversarial Attacks on Aligned Language Models

ALPHA: AnomaLous Physiological Health Assessment Using Large Language Models

LogLLM: Log-based Anomaly Detection Using Large Language Models

Stylometric Watermarks for Large Language Models

DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection

ADAGENT: Anomaly Detection Agent with Multimodal Large Models in Adverse Environments