Tianhao Wu,Weizhe Yuan,Olga Golovneva,Jing Xu,Yuandong Tian,Jiantao Jiao,Jason Weston,Sainbayar Sukhbaatar
Abstract:Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.
What problem does this paper attempt to address?
The paper primarily aims to address two key issues in the self-improvement process of large language models (LLMs):
1. **How to enhance the model's self-assessment capability**: Existing self-reward mechanisms mainly focus on improving the quality of the model's generated responses, while neglecting the enhancement of the model's ability as a judge. This leads to rapid saturation of model performance improvement during iterative training. To solve this problem, the paper proposes a method called "Meta-Rewarding," which improves the model's judging skills by having it evaluate its own judgment results.
2. **How to overcome the length bias problem**: During the evaluation process, the model tends to favor longer responses, which results in the continuous increase in response length during iterative training. To address this issue, the paper introduces a scoring mechanism that incorporates length information, ensuring that shorter responses are chosen when the scores are close.
Specifically, the "Meta-Rewarding" method proposed in the paper includes the following key steps:
- The model plays three roles: actor (generating responses), judge (evaluating responses), and meta-judge (evaluating the judge's evaluation results).
- Training data is generated through a self-play process, where the model, as an actor, generates multiple response variants, as a judge, scores these responses, and as a meta-judge, compares the quality of the judgment results.
- Preference data is used to train the model to improve its performance as both an actor and a judge.
- A simple length control mechanism is introduced to prevent the response length from growing excessively with iterations.
Experimental results show that models using the "Meta-Rewarding" method significantly improve performance in the AlpacaEval 2 and Arena-Hard benchmarks, especially in answering complex and difficult questions. Additionally, this method effectively controls the growth of response length, avoiding the negative impact of length bias. Overall, the "Meta-Rewarding" method effectively enhances the model's instruction-following ability and judgment accuracy without the need for additional human-supervised data.