Abstract:Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

What problem does this paper attempt to address?

The paper primarily aims to address two key issues in the self-improvement process of large language models (LLMs): 1. **How to enhance the model's self-assessment capability**: Existing self-reward mechanisms mainly focus on improving the quality of the model's generated responses, while neglecting the enhancement of the model's ability as a judge. This leads to rapid saturation of model performance improvement during iterative training. To solve this problem, the paper proposes a method called "Meta-Rewarding," which improves the model's judging skills by having it evaluate its own judgment results. 2. **How to overcome the length bias problem**: During the evaluation process, the model tends to favor longer responses, which results in the continuous increase in response length during iterative training. To address this issue, the paper introduces a scoring mechanism that incorporates length information, ensuring that shorter responses are chosen when the scores are close. Specifically, the "Meta-Rewarding" method proposed in the paper includes the following key steps: - The model plays three roles: actor (generating responses), judge (evaluating responses), and meta-judge (evaluating the judge's evaluation results). - Training data is generated through a self-play process, where the model, as an actor, generates multiple response variants, as a judge, scores these responses, and as a meta-judge, compares the quality of the judgment results. - Preference data is used to train the model to improve its performance as both an actor and a judge. - A simple length control mechanism is introduced to prevent the response length from growing excessively with iterations. Experimental results show that models using the "Meta-Rewarding" method significantly improve performance in the AlpacaEval 2 and Arena-Hard benchmarks, especially in answering complex and difficult questions. Additionally, this method effectively controls the growth of response length, avoiding the negative impact of length bias. Overall, the "Meta-Rewarding" method effectively enhances the model's instruction-following ability and judgment accuracy without the need for additional human-supervised data.

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Self-Rewarding Language Models

Reasons to Reject? Aligning Language Models with Judgments

Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language

Language Imbalance Driven Rewarding for Multilingual Self-improving

Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

Aligning Large Language Models via Fine-grained Supervision

Self-Taught Evaluators

Language Model Self-improvement by Reinforcement Learning Contemplation

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Learning Reward for Robot Skills Using Large Language Models via Self-Alignment

Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

Teaching Language Models to Self-Improve by Learning from Language Feedback

Self-Generated Critiques Boost Reward Modeling for Language Models

Self-Boosting Large Language Models with Synthetic Preference Data

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

ALaRM: Align Language Models via Hierarchical Rewards Modeling

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

Progressively Label Enhancement for Large Language Model Alignment

LongReward: Improving Long-context Large Language Models with AI Feedback