Shared Task on Evaluating Accuracy in Natural Language Generation

Ehud Reiter,Craig Thomson

DOI: https://doi.org/10.48550/arXiv.2006.12234

2020-06-22

Computation and Language

Abstract:We propose a shared task on methodologies and algorithms for evaluating the accuracy of generated texts. Participants will measure the accuracy of basketball game summaries produced by NLG systems from basketball box score data.

What problem does this paper attempt to address?

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Craig Thomson,Ehud Reiter

DOI: https://doi.org/10.48550/arXiv.2011.03992

2020-11-08

Abstract:Most Natural Language Generation systems need to produce accurate texts. We propose a methodology for high-quality human evaluation of the accuracy of generated texts, which is intended to serve as a gold-standard for accuracy evaluations of data-to-text systems. We use our methodology to evaluate the accuracy of computer generated basketball summaries. We then show how our gold standard evaluation can be used to validate automated metrics

Computation and Language
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Qingyu Lu,Liang Ding,Liping Xie,Kanjian Zhang,Derek F. Wong,Dacheng Tao

DOI: https://doi.org/10.18653/v1/2023.acl-long.324

2023-01-01

Abstract:The pretrained language model (PLM) based metrics have been successfully used in evaluating language generation tasks. Recent studies of the human evaluation community show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality judgments. This inspires us to approach the final goal of the automatic metrics (human-like evaluations) by fine-grained error analysis. In this paper, we argue that the ability to estimate sentence confidence is the tip of the iceberg for PLM-based metrics. And it can be used to refine the generated sentence toward higher confidence and more reference-grounded, where the costs of refining and approaching reference are used to determine the major and minor errors, respectively. To this end, we take BARTScore as the testbed and present an innovative solution to marry the unexploited sentence refining capacity of BARTScore and human-like error analysis, where the final score consists of both the evaluations of major and minor errors. Experiments show that our solution consistently improves BARTScore, outperforming top-scoring metrics in 19/25 test settings. Analyses demonstrate our method robustly and efficiently approaches human-like evaluations, enjoying better interpretability. Our code and scripts will be publicly released in https://github.com/Coldmist-Lu/ ErrorAnalysis_NLGEvaluation.
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Simon Mille,Kaustubh D. Dhole,Saad Mahamood,Laura Perez-Beltrachini,Varun Gangal,Mihir Kale,Emiel van Miltenburg,Sebastian Gehrmann

DOI: https://doi.org/10.48550/arXiv.2106.09069

2021-06-16

Computation and Language

Abstract:Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text

Vivek Srivastava,Mayank Singh

DOI: https://doi.org/10.48550/arXiv.2108.01861

2021-08-04

Abstract:In this shared task, we seek the participating teams to investigate the factors influencing the quality of the code-mixed text generation systems. We synthetically generate code-mixed Hinglish sentences using two distinct approaches and employ human annotators to rate the generation quality. We propose two subtasks, quality rating prediction and annotators' disagreement prediction of the synthetic Hinglish dataset. The proposed subtasks will put forward the reasoning and explanation of the factors influencing the quality and human perception of the code-mixed text.

Computation and Language
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Mingkai Deng,Bowen Tan,Zhengzhong Liu,Eric P. Xing,Zhiting Hu

DOI: https://doi.org/10.48550/arXiv.2109.06379

2022-01-22

Abstract:Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective that facilitates the design of metrics for a wide range of language generation tasks and quality aspects. Based on the nature of information change from input to output, we classify NLG tasks into compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). The information alignment, or overlap, between input, context, and output text plays a common central role in characterizing the generation. Using the uniform concept of information alignment, we develop a family of interpretable metrics for various NLG tasks and aspects, often without need of gold reference data. To operationalize the metrics, we train self-supervised models to approximate information alignment as a prediction task. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog. With information alignment as the intermediate representation, we deliver a composable library for easy NLG evaluation and future metric design.

Computation and Language,Machine Learning
GLGE: A New General Language Generation Evaluation Benchmark

Dayiheng Liu,Yu Yan,Yeyun Gong,Weizhen Qi,Hang Zhang,Jian Jiao,Weizhu Chen,Jie Fu,Linjun Shou,Ming Gong,Pengcheng Wang,Jiusheng Chen,Daxin Jiang,Jiancheng Lv,Ruofei Zhang,Winnie Wu,Ming Zhou,Nan Duan

DOI: https://doi.org/10.18653/v1/2021.findings-acl.36

2020-01-01

Abstract:Multi-task benchmarks such as GLUE and Su-perGLUE have driven great progress of pretraining and transfer learning in Natural Language Processing (NLP).These benchmarks mostly focus on a range of Natural Language Understanding (NLU) tasks, without considering the Natural Language Generation (NLG) models.In this paper, we present the General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks.For each task, we continue to design three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, and GLGE-Hard).This introduces 24 subtasks to comprehensively compare model performance.To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines including MASS, BART, and Prophet-Net 1 .
Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

Simone Filice,Jason Ingyu Choi,Giuseppe Castellucci,Eugene Agichtein,Oleg Rokhlenko

2023-11-21

Abstract:Many Natural Language Generation (NLG) tasks aim to generate a single output text given an input prompt. Other settings require the generation of multiple texts, e.g., for Synthetic Traffic Generation (STG). This generation task is crucial for training and evaluating QA systems as well as conversational agents, where the goal is to generate multiple questions or utterances resembling the linguistic variability of real users. In this paper, we show that common NLG metrics, like BLEU, are not suitable for evaluating STG. We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts. We validate our metrics with an automatic procedure to verify whether they capture different types of quality issues of generated data; we also run human annotations to verify the correlation with human judgements. Experiments on three tasks, i.e., Shopping Utterance Generation, Product Question Generation and Query Auto Completion, demonstrate that our metrics are effective for evaluating STG tasks, and improve the agreement with human judgement up to 20% with respect to common NLG metrics. We believe these findings can pave the way towards better solutions for estimating the representativeness of synthetic text data.

Computation and Language
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Albert Gatt,Emiel Krahmer

DOI: https://doi.org/10.1613/jair.5477

2018-01-27

Journal of Artificial Intelligence Research

Abstract:This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past two decades, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of NLP, with an emphasis on different evaluation methods and the relationships between them.

computer science, artificial intelligence
Analysis of Systems' Performance in Natural Language Processing Competitions

Sergio Nava-Muñoz,Mario Graff,Hugo Jair Escalante

DOI: https://doi.org/10.1016/j.patrec.2024.03.010

2024-08-21

Abstract:Collaborative competitions have gained popularity in the scientific and technological fields. These competitions involve defining tasks, selecting evaluation scores, and devising result verification methods. In the standard scenario, participants receive a training set and are expected to provide a solution for a held-out dataset kept by organizers. An essential challenge for organizers arises when comparing algorithms' performance, assessing multiple participants, and ranking them. Statistical tools are often used for this purpose; however, traditional statistical methods often fail to capture decisive differences between systems' performance. This manuscript describes an evaluation methodology for statistically analyzing competition results and competition. The methodology is designed to be universally applicable; however, it is illustrated using eight natural language competitions as case studies involving classification and regression problems. The proposed methodology offers several advantages, including off-the-shell comparisons with correction mechanisms and the inclusion of confidence intervals. Furthermore, we introduce metrics that allow organizers to assess the difficulty of competitions. Our analysis shows the potential usefulness of our methodology for effectively evaluating competition results.

Machine Learning
Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers

Mika Hämäläinen,Khalid Alnajjar

DOI: https://doi.org/10.48550/arXiv.2108.00308

2021-08-01

Abstract:We survey human evaluation in papers presenting work on creative natural language generation that have been published in INLG 2020 and ICCC 2020. The most typical human evaluation method is a scaled survey, typically on a 5 point scale, while many other less common methods exist. The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value, among many others. Our guidelines for future evaluation include clearly defining the goal of the generative system, asking questions as concrete as possible, testing the evaluation setup, using multiple different evaluation setups, reporting the entire evaluation process and potential biases clearly, and finally analyzing the evaluation results in a more profound way than merely reporting the most typical statistics.

Computation and Language
LLM-based NLG Evaluation: Current Status and Challenges

Mingqi Gao,Xinyu Hu,Jie Ruan,Xiao Pu,Xiaojun Wan

DOI: https://doi.org/10.48550/arXiv.2402.01383

2024-02-02

Computation and Language

Abstract:Evaluating natural language generation (NLG) is a vital but challenging problem in artificial intelligence. Traditional evaluation metrics mainly capturing content (e.g. n-gram) overlap between system outputs and references are far from satisfactory, and large language models (LLMs) such as ChatGPT have demonstrated great potential in NLG evaluation in recent years. Various automatic evaluation methods based on LLMs have been proposed, including metrics derived from LLMs, prompting LLMs, and fine-tuning LLMs with labeled evaluation data. In this survey, we first give a taxonomy of LLM-based NLG evaluation methods, and discuss their pros and cons, respectively. We also discuss human-LLM collaboration for NLG evaluation. Lastly, we discuss several open problems in this area and point out future research directions.
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Jungo Kasai,Keisuke Sakaguchi,Ronan Le Bras,Lavinia Dunagan,Jacob Morrison,Alexander R. Fabbri,Yejin Choi,Noah A. Smith

DOI: https://doi.org/10.48550/arXiv.2112.04139

2022-05-19

Abstract:Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation models and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlation with human judgments. We release four Billboards for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future. Our project website is available at <a class="link-external link-https" href="https://nlp.cs.washington.edu/billboard/" rel="external noopener nofollow">this https URL</a>.

Computation and Language
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Anas Himmi,Ekhine Irurozki,Nathan Noiry,Stephan Clemencon,Pierre Colombo

2023-05-17

Abstract:The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evaluated on entire tasks. This paper formalize an existing problem in NLP research: benchmarking when some systems scores are missing on the task, and proposes a novel approach to address it. Our method utilizes a compatible partial ranking approach to impute missing data, which is then aggregated using the Borda count method. It includes two refinements designed specifically for scenarios where either task-level or instance-level scores are available. We also introduce an extended benchmark, which contains over 131 million scores, an order of magnitude larger than existing benchmarks. We validate our methods and demonstrate their effectiveness in addressing the challenge of missing system evaluation on an entire task. This work highlights the need for more comprehensive benchmarking approaches that can handle real-world scenarios where not all systems are evaluated on the entire task.

Computation and Language,Artificial Intelligence
Recent Advances in Neural Text Generation: A Task-Agnostic Survey

Chen Tang,Frank Guerin,Chenghua Lin

2023-06-12

Abstract:In recent years, considerable research has been dedicated to the application of neural models in the field of natural language generation (NLG). The primary objective is to generate text that is both linguistically natural and human-like, while also exerting control over the generation process. This paper offers a comprehensive and task-agnostic survey of the recent advancements in neural text generation. These advancements have been facilitated through a multitude of developments, which we categorize into four key areas: data construction, neural frameworks, training and inference strategies, and evaluation metrics. By examining these different aspects, we aim to provide a holistic overview of the progress made in the field. Furthermore, we explore the future directions for the advancement of neural text generation, which encompass the utilization of neural pipelines and the incorporation of background knowledge. These avenues present promising opportunities to further enhance the capabilities of NLG systems. Overall, this survey serves to consolidate the current state of the art in neural text generation and highlights potential avenues for future research and development in this dynamic field.

Computation and Language,Artificial Intelligence
NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Iftitahu Ni'mah,Meng Fang,Vlado Menkovski,Mykola Pechenizkiy

DOI: https://doi.org/10.48550/arXiv.2305.08566

2023-05-26

Abstract:In this study, we analyze automatic evaluation metrics for Natural Language Generation (NLG), specifically task-agnostic metrics and human-aligned metrics. Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks, yet they have a weak correlation with human. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. However, their effectiveness at discerning system-level performance and quality of system outputs remain unclear. We present metric preference checklist as a framework to assess the effectiveness of automatic metrics in three NLG tasks: Text Summarization, Dialogue Response Generation, and Controlled Generation. Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation. We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks. We also show that multi-aspect human-aligned metric (UniEval) is not necessarily dominant over single-aspect human-aligned metrics (CTC, CtrlEval) and task-agnostic metrics (BLEU, BERTScore), particularly in Controlled Generation tasks.

Computation and Language
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Andrea Sottana,Bin Liang,Kai Zou,Zheng Yuan

DOI: https://doi.org/10.18653/v1/2023.emnlp-main.543

2023-01-01

Abstract:Large Language Models (LLMs) evaluation is a patchy and inconsistent landscape, and it is becoming clear that the quality of automatic evaluation metrics is not keeping up with the pace of development of generative models. We aim to improve the understanding of current models' performance by providing a preliminary and hybrid evaluation on a range of open and closed-source generative LLMs on three NLP benchmarks: text summarisation, text simplification and grammatical error correction (GEC), using both automatic and human evaluation. We also explore the potential of the recently released GPT-4 to act as an evaluator. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics, while scoring much more poorly when using classic automatic evaluation metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks. Finally, we find that GPT-4 is capable of ranking models' outputs in a way which aligns reasonably closely to human judgement despite task-specific variations, with a lower alignment in the GEC task.
Overview of the NLPCC-ICCPOL 2016 Shared Task: Sports News Generation from Live Webcast Scripts

Xiaojun Wan, Jianmin Zhang, Jin-ge Yao, Tianming Wang

DOI: https://doi.org/10.1007/978-3-319-50496-4_80

2016-01-01

Abstract:Live webcast scripts are valuable resources for describing the process of sports games. This shared task aims to automatically generate sports news articles from live webcast scripts. The task can be considered a special case of single document summarization. In this overview paper, we will introduce the task, the evaluation dataset, the participating teams and the evaluation results. The dataset has been released publicly.
Systematic Task Exploration with LLMs: A Study in Citation Text Generation

Furkan Şahinuç,Ilia Kuznetsov,Yufang Hou,Iryna Gurevych

2024-07-05

Abstract:Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. Yet, this flexibility brings new challenges, as it introduces new degrees of freedom in formulating the task inputs and instructions and in evaluating model performance. To facilitate the exploration of creative NLG tasks, we propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric and has not yet been tackled within the LLM paradigm. Our results highlight the importance of systematically investigating both task instruction and input configuration when prompting LLMs, and reveal non-trivial relationships between different evaluation metrics used for citation text generation. Additional human generation and human evaluation experiments provide new qualitative insights into the task to guide future research in citation text generation. We make our code and data publicly available.

Computation and Language

Shared Task on Evaluating Accuracy in Natural Language Generation

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

GLGE: A New General Language Generation Evaluation Benchmark

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Analysis of Systems' Performance in Natural Language Processing Competitions

Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers

LLM-based NLG Evaluation: Current Status and Challenges

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Recent Advances in Neural Text Generation: A Task-Agnostic Survey

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Overview of the NLPCC-ICCPOL 2016 Shared Task: Sports News Generation from Live Webcast Scripts

Systematic Task Exploration with LLMs: A Study in Citation Text Generation