Yann Dubois,Balázs Galambosi,Percy Liang,Tatsunori B. Hashimoto
Abstract:LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for chat LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?". To achieve this, we first fit a generalized linear model to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. We release the code and leaderboard at
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the length - bias problem existing in the automatic evaluation of language models. Specifically, existing automatic evaluation tools, such as AlpacaEval, are cost - effective and scalable, but are prone to introducing complex biases, especially a preference for longer outputs. This preference can lead to unfair evaluation results and may be exploited by model developers to artificially boost evaluation scores. Therefore, the paper proposes a regression - analysis - based method to control these biases, especially researching and improving on the length bias in AlpacaEval.
### Main contributions:
1. **Proposing a simple regression - based de - biasing method**: This method can meet some desirable properties of automatic evaluation metrics, such as low cost, accuracy, and robustness.
2. **Application to AlpacaEval**: By introducing length control, a new evaluation metric, AlpacaEval - LC, is generated, which is more robust to length - related spurious correlations.
3. **Improving the correlation with human evaluation**: Experimental results show that AlpacaEval - LC has a significant improvement in the correlation with LMSYS's Chatbot Arena, increasing from 0.94 to 0.98.
4. **Reducing the manipulability of evaluation**: By controlling the length, the new metric is less sensitive to the length of the model output, making the evaluation results more difficult to be artificially manipulated.
### Method overview:
- **Regression model**: Use the generalized linear model (GLM) to predict the preferences of the automatic evaluator, taking into account three factors: model identity, output length, and instruction difficulty.
- **Length control**: Obtain the length - controlled preference estimate by setting the output length difference to zero in the model.
- **Training and validation**: Use cross - validation and L2 regularization to prevent overfitting and ensure the robustness and interpretability of the model.
### Experimental results:
- **Reducing length manipulability**: AlpacaEval - LC has a significantly reduced sensitivity to prompts of different lengths, with the standardized standard deviation dropping from 25% to 10%.
- **Improving the correlation with Chatbot Arena**: The Spearman correlation coefficient increases from 0.94 to 0.98.
- **Robustness and interpretability**: Through regularization, AlpacaEval - LC has better resistance to truncation attacks and maintains its interpretability as a win rate.
### Discussion:
- **Other biases**: Although the paper mainly focuses on length bias, the proposed regression method can also be applied to other types of biases, such as the model's preference for its own output or the presence of lists.
- **Application in RLHF**: The method proposed in the paper can be used for de - biasing the reward model in reinforcement learning, and future research can further explore this direction.
In conclusion, this paper effectively solves the length - bias problem in automatic evaluation tools by introducing a regression method with length control, improving the fairness and accuracy of evaluation.