The Art of Saying No: Contextual Noncompliance in Language Models

Faeze Brahman,Sachin Kumar,Vidhisha Balachandran,Pradeep Dasigi,Valentina Pyatkin,Abhilasha Ravichander,Sarah Wiegreffe,Nouha Dziri,Khyathi Chandu,Jack Hessel,Yulia Tsvetkov,Noah A. Smith,Yejin Choi,Hannaneh Hajishirzi

2024-07-02

Abstract:Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.

Computation and Language,Artificial Intelligence,Human-Computer Interaction

What problem does this paper attempt to address?

This paper discusses how language models should appropriately reject inappropriate or out-of-scope questions when processing user requests, rather than just focusing on safety issues. The authors propose a comprehensive context-based non-compliance classification system, which includes categories such as incomplete, unsupported, uncertain, human-like, and safety-related requests. They created an evaluation suite of 1000 non-compliance prompts to test the non-compliance capabilities of language models, and found that existing models exhibit high compliance rates in certain categories, such as GPT-4 incorrectly complying with 30% of requests in some cases. To improve this situation, the researchers explored different training strategies, such as using synthetic datasets and parameter-efficient methods (like low-rank adapters) for fine-tuning, to maintain appropriate non-compliance while avoiding over-rejection and preserving general capabilities. They also mentioned that direct fine-tuning may lead to over-rejection, and methods like low-rank adapters can better balance these demands. The paper also points out that some models exhibit high compliance when dealing with human-like requests, which may negatively impact user experience as anthropomorphizing AI models can lead to misinformation or overestimation of their abilities. Lastly, they demonstrate how preference-based fine-tuning can reduce over-rejection behavior.

The Art of Saying No: Contextual Noncompliance in Language Models

Refusal in Language Models Is Mediated by a Single Direction

I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?

Language Models in Dialogue: Conversational Maxims for Human-AI Interactions

We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text

Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Don't Say No: Jailbreaking LLM by Suppressing Refusal

Rethinking harmless refusals when fine-tuning foundation models

No Offense Taken: Eliciting Offensiveness from Language Models

Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism

Refusing Safe Prompts for Multi-modal Large Language Models

POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

LLM-CI: Assessing Contextual Integrity Norms in Language Models

A fine-grained comparison of pragmatic language understanding in humans and language models

Modulating Language Model Experiences through Frictions