Abstract:We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task. All the resourced are released in our project website: \url{<a class="link-external link-https" href="https://tiger-ai-lab.github.io/TIGERScore/" rel="external noopener nofollow">this https URL</a>}.

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

Towards Explainable Evaluation Metrics for Machine Translation

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark

GRUEN for Evaluating Linguistic Quality of Generated Text

DEE: Dual-stage Explainable Evaluation Method for Text Generation

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

BARTScore: Evaluating Generated Text as Text Generation

DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models

RepEval: Effective Text Evaluation with LLM Representation

QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection