ARGS: Alignment as Reward-Guided Search

Maxim Khanov,Jirayu Burapacheep,Yixuan Li

2024-01-24

Abstract:Aligning large language models with human objectives is paramount, yet common approaches including RLHF suffer from unstable and resource-intensive training. In response to this challenge, we introduce ARGS, Alignment as Reward-Guided Search, a novel framework that integrates alignment into the decoding process, eliminating the need for expensive RL training. By adjusting the model's probabilistic predictions using a reward signal, ARGS generates texts with semantic diversity while being aligned with human preferences, offering a promising and flexible solution for aligning language models. Notably, ARGS demonstrates consistent enhancements in average reward compared to baselines across diverse alignment tasks and various model dimensions. For example, under the same greedy-based decoding strategy, our method improves the average reward by 19.56% relative to the baseline and secures a preference or tie score of 64.33% in GPT-4 evaluation. We believe that our framework, emphasizing decoding-time alignment, paves the way for more responsive language models in the future. Code is publicly available at: \url{

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of aligning large language models (LLMs) with human goals. Specifically: 1. **Background**: - Large language models perform excellently in handling various tasks, but the diversity of training data can lead to the generation of misleading information or harmful outputs. - Current mainstream methods like RLHF (Reinforcement Learning with Human Feedback) are effective but suffer from instability and resource-intensive issues during the training process. 2. **Proposed Method**: - The paper proposes a new framework called ARGS (Alignment as Reward-Guided Search), which directly adjusts the model's probability predictions during decoding, using reward signals to generate text that aligns with human preferences. - This method avoids the instability and high costs associated with traditional RL training processes, allowing the model to quickly adapt to new requirements during the decoding phase. 3. **Main Contributions**: - ARGS not only improves the average reward score of the generated text but also enhances the diversity and consistency of the text. - Experimental results show that ARGS significantly outperforms baseline methods across various model architectures and scales. - The paper emphasizes the importance of alignment during decoding time, providing a new perspective for future AI safety research. Through these improvements, ARGS offers a flexible and efficient solution, enabling language models to better respond to contemporary demands without the need for extensive retraining.

ARGS: Alignment as Reward-Guided Search

InfAlign: Inference-aware language model alignment

ALaRM: Align Language Models via Hierarchical Rewards Modeling

Aligning Large Language Models with Representation Editing: A Control Perspective

DeAL: Decoding-time Alignment for Large Language Models

Evolving Alignment via Asymmetric Self-Play

Aligner: Efficient Alignment by Learning to Correct

MaxMin-RLHF: Alignment with Diverse Human Preferences

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

Cascade Reward Sampling for Efficient Decoding-Time Alignment

Decoding-time Realignment of Language Models

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Rethinking the Role of Proxy Rewards in Language Model Alignment

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

Language Model Alignment with Elastic Reset

Learn Your Reference Model for Real Good Alignment

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

LIRE: listwise reward enhancement for preference alignment

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Efficient Model-agnostic Alignment via Bayesian Persuasion