LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

Guijin Son,Hyunwoo Ko,Hoyoung Lee,Yewon Kim,Seunghyeok Hong
2024-10-02
Abstract:LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness in diverse contexts, such as non-English prompts, factual verification, or challenging questions, remains unexplored. In this paper, we conduct a comprehensive analysis of automated evaluators, reporting several key findings on their behavior. First, we discover that English evaluation capabilities significantly influence language-specific evaluation capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we find that state-of-the-art evaluators struggle with challenging prompts, in either English or Korean, underscoring their limitations in assessing or generating complex reasoning questions. We release the dataset and codes used.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Effectiveness of automated evaluators in new languages**: Researchers hope to understand how large language models (LLMs) perform as evaluators or reward models in different language environments, especially whether they can be effectively transferred to languages they have not been trained on. This includes evaluating the capabilities of these models in non - English prompts, fact - verification, and complex - problem handling. 2. **Limitations of automated evaluators**: The paper also explores the shortcomings of the current state - of - the - art evaluation models when faced with challenging prompts, especially their ability to detect and penalize errors (such as factual inaccuracies, cultural misrepresentation, and inappropriate language use). 3. **Transfer of cross - language evaluation capabilities**: Researchers create a bilingual meta - evaluation dataset (KUDGE) to test and analyze the transfer of evaluation capabilities of these models between different languages, especially the transfer effect from English to Korean. Specifically, the paper explores these problems in the following ways: - **Creating the KUDGE dataset**: This dataset contains an original subset and a challenge subset, which are used to evaluate the performance of models in regular tasks and complex reasoning tasks respectively. - **Experimental setup**: 20 different LLMs, including proprietary models and open - source models, are evaluated to compare their performance in point - to - point and pairwise evaluation tasks. - **Regression analysis**: Through regression analysis, researchers find that the performance of models on English evaluation benchmarks (such as RewardBench) is more predictive of their performance on KUDGE than their performance on Korean - specific benchmarks (such as KMMLU), indicating that evaluation capabilities are to some extent language - independent. - **Performance of fine - tuned models**: Researchers also test the performance of fine - tuned models (such as Prometheus2) in Korean evaluation tasks and find that even if they are mainly trained in an English environment, these models can be applied to Korean evaluation to a certain extent, but there is a certain English bias. Through these studies, the paper reveals the potential and limitations of automated evaluators in a multilingual environment, providing important insights for further improving these models.