Abstract:LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness in diverse contexts, such as non-English prompts, factual verification, or challenging questions, remains unexplored. In this paper, we conduct a comprehensive analysis of automated evaluators, reporting several key findings on their behavior. First, we discover that English evaluation capabilities significantly influence language-specific evaluation capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we find that state-of-the-art evaluators struggle with challenging prompts, in either English or Korean, underscoring their limitations in assessing or generating complex reasoning questions. We release the dataset and codes used.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Effectiveness of automated evaluators in new languages**: Researchers hope to understand how large language models (LLMs) perform as evaluators or reward models in different language environments, especially whether they can be effectively transferred to languages they have not been trained on. This includes evaluating the capabilities of these models in non - English prompts, fact - verification, and complex - problem handling. 2. **Limitations of automated evaluators**: The paper also explores the shortcomings of the current state - of - the - art evaluation models when faced with challenging prompts, especially their ability to detect and penalize errors (such as factual inaccuracies, cultural misrepresentation, and inappropriate language use). 3. **Transfer of cross - language evaluation capabilities**: Researchers create a bilingual meta - evaluation dataset (KUDGE) to test and analyze the transfer of evaluation capabilities of these models between different languages, especially the transfer effect from English to Korean. Specifically, the paper explores these problems in the following ways: - **Creating the KUDGE dataset**: This dataset contains an original subset and a challenge subset, which are used to evaluate the performance of models in regular tasks and complex reasoning tasks respectively. - **Experimental setup**: 20 different LLMs, including proprietary models and open - source models, are evaluated to compare their performance in point - to - point and pairwise evaluation tasks. - **Regression analysis**: Through regression analysis, researchers find that the performance of models on English evaluation benchmarks (such as RewardBench) is more predictive of their performance on KUDGE than their performance on Korean - specific benchmarks (such as KMMLU), indicating that evaluation capabilities are to some extent language - independent. - **Performance of fine - tuned models**: Researchers also test the performance of fine - tuned models (such as Prometheus2) in Korean evaluation tasks and find that even if they are mainly trained in an English environment, these models can be applied to Korean evaluation to a certain extent, but there is a certain English bias. Through these studies, the paper reveals the potential and limitations of automated evaluators in a multilingual environment, providing important insights for further improving these models.

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Can Large Language Models Be an Alternative to Human Evaluations?

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation

Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

A Survey on LLM-as-a-Judge

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Assessing the Proficiency of LLMs with Various Tasks and Evaluators

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Evaluating the Consistency of LLM Evaluators

Can LLM be a Personalized Judge?