Abstract:Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at <a class="link-external link-https" href="https://github.com/dynamofl/PrimeGuard" rel="external noopener nofollow">this https URL</a> and safe-eval dataset is available at <a class="link-external link-https" href="https://huggingface.co/datasets/dynamoai/safe_eval" rel="external noopener nofollow">this https URL</a>.

PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings

GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models

MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability

Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Building a Domain-specific Guardrail Model in Production

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Granite Guardian

Guardian: A Runtime Framework for LLM-based UI Exploration

TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment

Building Guardrails for Large Language Models

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

AI Control: Improving Safety Despite Intentional Subversion

When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails

A Framework for Real-time Safeguarding the Text Generation of Large Language Model

Self-Guard: Empower the LLM to Safeguard Itself

Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models