Abstract:Large Language Models (LLMs) have demonstrated impressive performance on Natural Language Processing (NLP) tasks, such as Question Answering, Summarization, and Classification. The use of LLMs as evaluators, that can rank or score the output of other models (usually LLMs) has become increasingly popular, due to the limitations of current evaluation techniques including the lack of appropriate benchmarks, metrics, cost, and access to human annotators. While LLMs are capable of handling approximately 100 languages, the majority of languages beyond the top 20 lack systematic evaluation across various tasks, metrics, and benchmarks. This creates an urgent need to scale up multilingual evaluation to ensure a precise understanding of LLM performance across diverse languages. LLM-based evaluators seem like the perfect solution to this problem, as they do not require human annotators, human-created references, or benchmarks and can theoretically be used to evaluate any language covered by the LLM. In this paper, we investigate whether LLM-based evaluators can help scale up multilingual evaluation. Specifically, we calibrate LLM-based evaluation against 20k human judgments of five metrics across three text-generation tasks in eight languages. Our findings indicate that LLM-based evaluators may exhibit bias towards higher scores and should be used with caution and should always be calibrated with a dataset of native speaker judgments, particularly in low-resource and non-Latin script languages.

Evaluation of Reliability Criteria for News Publishers with Large Language Models

Large Language Models' Detection of Political Orientation in Newspapers

Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Cognitive Biases in Large Language Models for News Recommendation

Large Language Models are Inconsistent and Biased Evaluators

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Evaluating the Efficacy of Large Language Models in Detecting Fake News: A Comparative Analysis

Can Large Language Models Be an Alternative to Human Evaluations?

Style Over Substance: Evaluation Biases for Large Language Models

Benchmarking Large Language Models for News Summarization

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Evaluating Trustworthiness of Online News Publishers via Article Classification

Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature

Evaluating the Consistency of LLM Evaluators

Are Large Language Models Reliable Argument Quality Annotators?

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Ranking Large Language Models without Ground Truth

Are Large Language Models Good Fact Checkers: A Preliminary Study