Abstract:The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach.

What problem does this paper attempt to address?

This paper discusses how to improve the safety of large-scale language models (LLMs) through Constrained Direct Preference Optimization (C-DPO). With the rapid improvement in LLM capabilities, it becomes crucial to ensure that these AI systems align with human preferences while enhancing both their utility and safety, even though these goals may conflict with each other. The paper proposes a new approach to implement safety constraints through Constrained Reinforcement Learning from Human Feedback (RLHF) framework during the fine-tuning stage. However, traditional RLHF methods are computationally expensive and unstable. C-DPO is an extension of Direct Preference Optimization (DPO) methods recently proposed, which is a stable and lightweight alternative that directly optimizes the policies in the offline preference dataset without the use of reinforcement learning. By combining bi-level gradient descent and DPO, this method can find a near-optimal balance between helpfulness and harmlessness without compromising utility and achieve higher rewards than safe RLHF methods under the same constraints, providing safety guarantees that LLMs lack. The paper introduces the workflow of C-DPO, including transforming the original optimization problem into the Lagrangian form and then solving it by iteratively updating the policy variable and Lagrange multipliers, where the policy variable is optimized through maximum likelihood with the new preference dataset, and the Lagrange multipliers are updated through gradient descent to minimize the expected violation of safety constraints. The experimental section demonstrates the performance of C-DPO in enhancing the safety and utility of LLMs. Compared to baseline methods such as Supervised Fine Tuning (SFT), original DPO, and PPO-based Beaver-v1, C-DPO achieves higher rewards while maintaining safety. In particular, C-DPO effectively balances helpfulness and harmlessness, thereby improving the model's performance while satisfying safety constraints.

Enhancing LLM Safety via Constrained Direct Preference Optimization

Enhancing LLM Safety via Constrained Direct Preference Optimization

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Minor DPO reject penalty to increase training robustness

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

Uncertainty-Penalized Direct Preference Optimization

On the Generalization of Preference Learning with DPO

Prompt-Driven LLM Safeguarding via Directed Representation Optimization

Direct Preference Optimization Using Sparse Feature-Level Constraints

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

CCPO: Conservatively Constrained Policy Optimization Using State Augmentation

New Desiderata for Direct Preference Optimization

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective