Abstract:As LLMs become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs. Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories. To address these limitations, we propose $R^2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, $R^2$-Guard comprises two parts: data-driven category-specific learning and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories. We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories. We demonstrate the effectiveness of $R^2$-Guard by comparisons with eight strong guardrail models on six safety benchmarks, and demonstrate the robustness of $R^2$-Guard against four SOTA jailbreaking attacks. $R^2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.

RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting

On Prompt-Driven Safeguarding for Large Language Models

AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

Prompt-Driven LLM Safeguarding via Directed Representation Optimization

Mitigating Exaggerated Safety in Large Language Models

A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Refusing Safe Prompts for Multi-modal Large Language Models

Enhancing Large Language Model Capabilities for Rumor Detection with Knowledge-Powered Prompting

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

A Framework for Real-time Safeguarding the Text Generation of Large Language Model

Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

Certifying LLM Safety against Adversarial Prompting

Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield