Enhancing LLM Safety via Constrained Direct Preference Optimization

Zixuan Liu,Xiaolin Sun,Zizhan Zheng
2024-03-05
Abstract:The rapidly increasing capabilities of large language models (LLMs) raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
This paper discusses how to improve the safety of large-scale language models (LLMs) through Constrained Direct Preference Optimization (C-DPO). With the rapid improvement in LLM capabilities, it becomes crucial to ensure that these AI systems align with human preferences while enhancing both their utility and safety, even though these goals may conflict with each other. The paper proposes a new approach to implement safety constraints through Constrained Reinforcement Learning from Human Feedback (RLHF) framework during the fine-tuning stage. However, traditional RLHF methods are computationally expensive and unstable. C-DPO is an extension of Direct Preference Optimization (DPO) methods recently proposed, which is a stable and lightweight alternative that directly optimizes the policies in the offline preference dataset without the use of reinforcement learning. By combining bi-level gradient descent and DPO, this method can find a near-optimal balance between helpfulness and harmlessness without compromising utility and achieve higher rewards than safe RLHF methods under the same constraints, providing safety guarantees that LLMs lack. The paper introduces the workflow of C-DPO, including transforming the original optimization problem into the Lagrangian form and then solving it by iteratively updating the policy variable and Lagrange multipliers, where the policy variable is optimized through maximum likelihood with the new preference dataset, and the Lagrange multipliers are updated through gradient descent to minimize the expected violation of safety constraints. The experimental section demonstrates the performance of C-DPO in enhancing the safety and utility of LLMs. Compared to baseline methods such as Supervised Fine Tuning (SFT), original DPO, and PPO-based Beaver-v1, C-DPO achieves higher rewards while maintaining safety. In particular, C-DPO effectively balances helpfulness and harmlessness, thereby improving the model's performance while satisfying safety constraints.