PRE: A Peer Review Based Large Language Model Evaluator

Zhumin Chu,Qingyao Ai,Yiteng Tu,Haitao Li,Yiqun Liu

2024-06-03

Abstract:The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, and inherited biases in practice, which make them incapable of supporting the sustainable development of LLMs in long term. In order to address these issues, inspired by the peer review systems widely used in academic publication process, we propose a novel framework that can automatically evaluate LLMs through a peer-review process. Specifically, for the evaluation of a specific task, we first construct a small qualification exam to select "reviewers" from a couple of powerful LLMs. Then, to actually evaluate the "submissions" written by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to rate or compare the submissions. The final ranking of evaluatee LLMs is generated based on the results provided by all reviewers. We conducted extensive experiments on text summarization tasks with eleven LLMs including GPT-4. The results demonstrate the existence of biasness when evaluating using a single LLM. Also, our PRE model outperforms all the baselines, illustrating the effectiveness of the peer review mechanism.

Information Retrieval,Computation and Language

What problem does this paper attempt to address?

The paper aims to address the effectiveness and efficiency issues in the performance evaluation of large language models (LLMs). Specifically, with the rapid development of LLMs in academia and industry, how to reliably and economically assess the capabilities of these models has become a key bottleneck limiting their progress. The existing evaluation paradigms, whether based on human annotation or model-based assessment, have disadvantages such as high cost, low universality, and inherent biases. To tackle these issues, the paper proposes a new framework named Peer Review Evaluator (PRE), inspired by the peer review system in the academic publishing process, which can automatically evaluate LLMs through a peer review process. The main contributions of the PRE framework are as follows: 1. A new automatic LLM evaluation framework is proposed, which incorporates the peer review mechanism for directly assessing the performance of LLMs. 2. Through qualification exams and result fusion, PRE can largely avoid the model biases commonly present in existing automatic evaluation methods, achieving effective LLM evaluation. 3. The paper validates the potential of the PRE framework through extensive experiments, including document summarization and non-factual question-answering tasks, demonstrating that the results of the PRE model have the highest consistency with human preferences (i.e., the real situation) compared to baseline models. In summary, the PRE framework aims to overcome the limitations of current LLM evaluation methods by introducing the concept of peer review, providing a more reliable, universal, and cost-effective evaluation solution.

PRE: A Peer Review Based Large Language Model Evaluator

Automatic Large Language Model Evaluation Via Peer Review

An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

Peer-review-in-LLMs: Automatic Evaluation Method for LLMs in Open-environment.

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

CriticEval: Evaluating Large Language Model as Critic

Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

Large Language Models are not Fair Evaluators

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

A Survey on Evaluation of Large Language Models

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

PiCO: Peer Review in LLMs based on the Consistency Optimization

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Critique Ability of Large Language Models

Can Large Language Models Serve as Evaluators for Code Summarization?

Style Over Substance: Evaluation Biases for Large Language Models

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension