Abstract:Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language -- functions that score system output given the context and/or human reference responses -- of critical importance. However, different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others. There is currently no simple, unified way to compare, analyse or evaluate metrics across a representative set of tasks. Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics), a resource to make research into new metrics itself easier to evaluate. BEAMetrics users can quickly compare existing and new metrics with human judgements across a diverse set of tasks, quality dimensions (fluency vs. coherence vs. informativeness etc), and languages. As generation experts might predict, BEAMetrics reveals stark task-dependent differences between existing metrics, and consistently poor performance on tasks with complex answer spaces or high reliance on general knowledge. While this analysis highlights a critical issue facing current research practice, BEAMetrics also contribute to its resolution by facilitating research into better metrics -- particularly those that can account for the complex interaction between context and general knowledge inherent to many modern NLP applications. BEAMetrics is available under the MIT License: <a class="link-external link-https" href="https://github.com/ThomasScialom/BEAMetrics" rel="external noopener nofollow">this https URL</a>

A Survey of Evaluation Metrics Used for NLG Systems

Automatic Metrics in Natural Language Generation: A Survey of Current Evaluation Practices

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

LLM-based NLG Evaluation: Current Status and Challenges

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Leveraging Large Language Models for NLG Evaluation: A Survey

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers

Is Reference Necessary in the Evaluation of NLG Systems? When and Where?

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

A Survey on Evaluation Metrics for Machine Translation

A Survey of Natural Language Generation

A Comprehensive Survey on Various Fully Automatic Machine Translation Evaluation Metrics

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review