Self-rationalization improves LLM as a fine-grained judge

Prapti Trivedi,Aditya Gulati,Oliver Molenschot,Meghana Arakkal Rajeev,Rajkumar Ramamurthy,Keith Stevens,Tanveesh Singh Chaudhery,Jahnavi Jambholkar,James Zou,Nazneen Rajani
2024-10-08
Abstract:LLM-as-a-judge models have been used for evaluating both human and AI generated content, specifically by providing scores and rationales. Rationales, in addition to increasing transparency, help models learn to calibrate its judgments. Enhancing a model's rationale can therefore improve its calibration abilities and ultimately the ability to score content. We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models, which consequently improves the score for fine-grained customizable scoring criteria (i.e., likert-scale scoring with arbitrary evaluation criteria). Self-rationalization works by having the model generate multiple judgments with rationales for the same input, curating a preference pair dataset from its own judgements, and iteratively fine-tuning the judge via DPO. Intuitively, this approach allows the judge model to self-improve by learning from its own rationales, leading to better alignment and evaluation accuracy. After just two iterations -- while only relying on examples in the training set -- human evaluation shows that our judge model learns to produce higher quality rationales, with a win rate of $62\%$ on average compared to models just trained via SFT on rationale . This judge model also achieves high scoring accuracy on BigGen Bench and Reward Bench, outperforming even bigger sized models trained using SFT with rationale, self-consistency or best-of-$N$ sampling by $3\%$ to $9\%$.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the ability of large - language models (LLMs) as judges, especially in providing scores and reasons. Specifically, the paper introduces a method named "Self - Rationalization". By iteratively generating multiple judgments and their reasons, then creating a preference - pair data set from these judgments, and using the direct preference optimization (DPO) technique to fine - tune the model, the scoring ability of the model under fine - grained custom scoring criteria and the quality of reasons are improved. ### Main Problems 1. **Improving Scoring Accuracy**: Existing LLMs - as - judge models have deficiencies in scoring accuracy, especially when dealing with tasks that require fine - grained scoring criteria. 2. **Enhancing Reason Quality**: Reasons generated by existing models are often not detailed or accurate enough, which affects the transparency and credibility of the models. 3. **Reducing Dependence on Human - Annotated Data**: Traditional training methods rely on a large amount of human - annotated data, which is costly and difficult to scale in practical applications. ### Solutions The paper proposes a new training method - "Self - Rationalization", and the specific steps are as follows: 1. **Seed Initialization**: Start from an initial model of supervised fine - tuning (JSFT), which has been trained on the initial annotated data set. 2. **Self - Rationalization**: For each input, generate multiple judgments and their reasons, and each judgment contains a score and a reason. 3. **Preference Data Organization**: Select high - quality judgments and low - quality judgments from the generated multiple judgments to form a preference - pair data set. 4. **Preference Optimization**: Use the direct preference optimization (DPO) technique to fine - tune the model to improve its ability to generate high - quality reasons and accurate scores. ### Experimental Results - **Performance Improvement**: After two iterations of self - rationalization, the model shows significant performance improvement in multiple benchmark tests, especially in fine - grained scoring tasks. - **Reason Quality**: Human evaluation shows that the model after self - rationalization generates reasons of higher quality, with a winning rate of 62%. - **Resource Efficiency**: Compared with the traditional supervised fine - tuning (SFT) method, the self - rationalization method requires fewer training samples and computing resources and has a faster convergence speed. ### Conclusion Through the self - rationalization method, the paper successfully improves the performance of LLMs - as - judge models in fine - grained scoring tasks, especially in generating high - quality reasons. This method not only improves the scoring accuracy of the model but also reduces the dependence on human - annotated data, and has important practical application value.