Abstract:Safety and trustworthiness are indispensable requirements for real-world applications of AI systems using large language models (LLMs). This paper formulates human value alignment as an optimization problem of the language model policy to maximize reward under a safety constraint, and then proposes an algorithm, Stepwise Alignment for Constrained Policy Optimization (SACPO). One key idea behind SACPO, supported by theory, is that the optimal policy incorporating reward and safety can be directly obtained from a reward-aligned policy. Building on this key idea, SACPO aligns LLMs step-wise with each metric while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO offers several advantages, including simplicity, stability, computational efficiency, and flexibility of algorithms and datasets. Under mild assumptions, our theoretical analysis provides the upper bounds on optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to optimize the behavior policies of large - language models (LLMs) on the premise of ensuring security, so as to better conform to human values. Specifically, the paper proposes an algorithm named Stepwise Alignment for Constrained Policy Optimization (SACPO), aiming to optimize the policies of language models through a step - by - step alignment method, making it satisfy safety constraints while maximizing rewards. This method can not only improve the helpfulness (i.e., usefulness) of the model, but also ensure that the model's behavior will not cause harm (i.e., harmlessness). ### Main contributions: 1. **Proposing the SACPO algorithm**: This algorithm first performs reward alignment on the model and then safety alignment through a step - by - step alignment method, thereby achieving multi - objective optimization without sacrificing flexibility and efficiency. 2. **Theoretical analysis**: The paper provides a theoretical analysis, proving that SACPO can theoretically achieve the same effect as the method that considers rewards and safety simultaneously, and gives the upper bounds of optimality and safety - constraint violations. 3. **Experimental verification**: The experimental results show that SACPO is superior to existing state - of - the - art methods, such as Safe RLHF, in terms of helpfulness and harmlessness. ### Problems solved: - **Limitations of a single reward function**: Traditional reinforcement learning methods (such as RLHF) usually only consider a single reward function and it is difficult to optimize multiple objectives simultaneously, especially when there are conflicts between these objectives. - **Complexity and stability problems**: Although Safe RLHF can optimize rewards and safety simultaneously, its process is relatively complex and prone to instability, especially when dealing with large - scale data. - **Lack of flexibility**: Existing methods often need to use the same algorithm and data set during the alignment process, lacking flexibility. ### Method characteristics: - **Step - by - step alignment**: SACPO avoids the complexity of optimizing multiple objectives simultaneously through a step - by - step alignment method, first performing reward alignment on the model and then safety alignment. - **Simple and flexible**: SACPO allows the use of different algorithms and data sets in each alignment step, improving the flexibility and practicality of the method. - **Theoretical guarantee**: The paper provides a strict theoretical analysis, proving the effectiveness and stability of SACPO. ### Experimental results: - **Performance improvement**: The experimental results show that SACPO is superior to existing state - of - the - art methods in terms of helpfulness and harmlessness, and is more stable especially when dealing with large - scale data. - **Practical applications**: SACPO can be used in multiple scenarios, such as reducing harmful outputs, suppressing biases, and increasing output diversity, and has broad application prospects. In conclusion, by proposing the SACPO algorithm, this paper effectively solves the problem of how to optimize the behavior policies of large - language models on the premise of ensuring security, providing new ideas and methods for the development of future AI systems.

Stepwise Alignment for Constrained Language Model Policy Optimization

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling

One-Shot Safety Alignment for Large Language Models via Optimal Dualization

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

Aligning Large Language Models via Fine-grained Supervision

Enhancing LLM Safety via Constrained Direct Preference Optimization

$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

PURE: Aligning LLM Via Pluggable Query Reformulation for Enhanced Helpfulness

Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

Direct Alignment of Language Models via Quality-Aware Self-Refinement

Towards Efficient Exact Optimization of Language Model Alignment

Value Augmented Sampling for Language Model Alignment and Personalization

Aligning Large Language Models via Self-Steering Optimization

MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization