Abstract:Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at <a class="link-external link-https" href="https://github.com/dynamofl/PrimeGuard" rel="external noopener nofollow">this https URL</a> and safe-eval dataset is available at <a class="link-external link-https" href="https://huggingface.co/datasets/dynamoai/safe_eval" rel="external noopener nofollow">this https URL</a>.

GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Automatic and Universal Prompt Injection Attacks against Large Language Models

A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

Prompt Injection attack against LLM-integrated Applications

Goal-Oriented Prompt Attack and Safety Evaluation for LLMs

Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models

"Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks

ShieldGPT: an LLM-based Framework for DDoS Mitigation

Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection

Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures

Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Assessing Prompt Injection Risks in 200+ Custom GPTs

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models

Defending Against Indirect Prompt Injection Attacks With Spotlighting