Abstract:Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically limits the model's performance and propagates these flaws into the learned model. To overcome these challenges, this paper proposes a novel data generation pipeline, named MultiCritique, that improves the critique ability of LLMs by utilizing multi-agent feedback in both the SFT and reinforcement learning (RL) stages. First, our data generation pipeline aggregates high-quality critiques from multiple agents instead of a single model, with crucial information as input for simplifying the critique. Furthermore, our pipeline improves the preference accuracy of critique quality through multi-agent feedback, facilitating the effectiveness of RL in improving the critique ability of LLMs. Based on our proposed MultiCritique data generation pipeline, we construct the MultiCritiqueDataset for the SFT and RL fine-tuning stages. Extensive experimental results on two benchmarks demonstrate: 1) the superior quality of our constructed SFT dataset compared to existing critique datasets; 2) additional improvements to the critique ability of LLMs brought by the RL stage. Notably, our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models, approaching the performance of advanced 70B LLMs and GPT-4. Codes, datasets and model weights will be publicly available.

Improving Reward Models with Synthetic Critiques

Self-Generated Critiques Boost Reward Modeling for Language Models

Self-critiquing models for assisting human evaluators

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Fine-Tuning Language Models from Human Preferences

Training Language Models to Critique With Multi-agent Feedback

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

RewardBench: Evaluating Reward Models for Language Modeling

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Interpreting Language Reward Models via Contrastive Explanations

Spontaneous Reward Hacking in Iterative Self-Refinement

Critique-out-Loud Reward Models

Self-Rewarding Language Models

RATE: Score Reward Models with Imperfect Rewrites of Rewrites

RRM: Robust Reward Model Training Mitigates Reward Hacking

Rethinking the Role of Proxy Rewards in Language Model Alignment

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards