CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

Pei Ke,Bosi Wen,Zhuoer Feng,Xiao Liu,Xuanyu Lei,Jiale Cheng,Shengyuan Wang,Aohan Zeng,Yuxiao Dong,Hongning Wang,Jie Tang,Minlie Huang

2024-06-26

Abstract:Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in the field of natural language processing (NLP), how to generate informative evaluation texts to assess the text quality generated by large language models (LLMs). Specifically, existing evaluation models lack the ability to generate detailed and discriminative evaluation information when conducting point - to - point scoring and pairwise comparisons, especially in the absence of reference texts. This results in their generated evaluation texts being unable to provide sufficient fine - grained information to distinguish the quality of different generated texts, thus affecting the evaluation performance. To solve this problem, the author proposes a simple and effective method, called Eval - Instruct, which can automatically construct information - instruction - tuning data for different evaluation tasks and settings through multi - path prompting techniques, including point - to - point scoring and pairwise comparisons (with and without references). The data obtained through this method can be used to train a model, CRITIQUE LLM, which is capable of generating high - quality evaluation texts. Experimental results show that CRITIQUE LLM performs comparably to GPT - 4 in terms of system - level correlation, and even outperforms ChatGPT and other open - source baseline models in terms of system - level correlation in point - to - point scoring. In addition, the evaluation texts generated by CRITIQUE LLM can also be used as scalable feedback to further improve the generation quality of powerful LLMs such as ChatGPT.

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

CriticEval: Evaluating Large Language Model as Critic

Training Language Models to Critique With Multi-agent Feedback

Critique Ability of Large Language Models

Large Language Models Are Active Critics in NLG Evaluation

The Critique of Critique

PRE: A Peer Review Based Large Language Model Evaluator

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Automatic Large Language Model Evaluation Via Peer Review

Self-Generated Critiques Boost Reward Modeling for Language Models

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

A Closer Look into Using Large Language Models for Automatic Evaluation

Eliciting Informative Text Evaluations with Large Language Models

Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems

Evaluating Large Language Models in Class-Level Code Generation

Review-LLM: Harnessing Large Language Models for Personalized Review Generation

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Self-critiquing models for assisting human evaluators

On the Evaluation of Large Language Models in Unit Test Generation