Abstract:Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models. We conduct experiments on five NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases. In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets. For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains

Evaluating the generation capabilities of large Chinese language models

CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark

QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation

Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese Spelling Correction

CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark

An Extensive Benchmark Study on Biomedical Text Generation and Mining with ChatGPT

GPTEval: A Survey on Assessments of ChatGPT and GPT-4

GTM: A Generative Triple-Wise Model for Conversational Question Generation

GLGE: A New General Language Generation Evaluation Benchmark

Advancing the Evaluation of Traditional Chinese Language Models: Towards a Comprehensive Benchmark Suite

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time

Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization

A Survey on the Real Power of ChatGPT

How Generative-AI can be Effectively used in Government Chatbots

A New Evaluation Method: Evaluation Data and Metrics for Chinese Grammar Error Correction

ChatQA: Surpassing GPT-4 on Conversational QA and RAG