Abstract:Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -- functions that score system output given the context and/or human reference responses -- of critical importance. However, different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others. There is currently no simple, unified way to compare, analyse or evaluate metrics across a representative set of tasks. Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics), a resource to make research into new metrics itself easier to evaluate. BEAMetrics users can quickly compare existing and new metrics with human judgements across a diverse set of tasks, quality dimensions (fluency vs. coherence vs. informativeness etc), and languages. As generation experts might predict, BEAMetrics reveals stark task-dependent differences between existing metrics, and consistently poor performance on tasks with complex answer spaces or high reliance on general knowledge. While this analysis highlights a critical issue facing current research practice, BEAMetrics also contribute to its resolution by facilitating research into better metrics -- particularly those that can account for the complex interaction between context and general knowledge inherent to many modern NLP applications. BEAMetrics is available under the MIT License: <a class="link-external link-https" href="https://github.com/ThomasScialom/BEAMetrics" rel="external noopener nofollow">this https URL</a>

OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation

What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

A Survey of Evaluation Metrics Used for NLG Systems

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

CogME: A Cognition-Inspired Multi-Dimensional Evaluation Metric for Story Understanding