Abstract:This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of "the Generative AI Paradox" (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper explores the performance differences between large language models (LLMs) in generation tasks and evaluation tasks, especially whether they can effectively serve as evaluators for a certain generation task when they perform well in that specific generation task. Specifically, the author focuses on the "Generative AI Paradox", that is, although some generative AI models perform well in generation tasks, they may perform poorly in evaluation tasks. #### Main problems: 1. **Inconsistency between generation and evaluation abilities**: The paper studies whether there are significant differences in the performance of LLMs in generation tasks and evaluation tasks. By conducting experiments on three LLMs (GPT - 3.5, GPT - 4, PaLM - 2) and an open - source language model (Vicuna - 13b), the author finds that these models generally perform better in generation tasks than in evaluation tasks. 2. **Unreliable evaluation phenomenon**: The author discovers the phenomenon of "unreliable evaluation", that is, the model accurately evaluates answers in some areas, but these areas are actually not what the model is good at. This indicates that the credibility and reliability of the model in evaluation tasks need further examination. 3. **Faithfulness in evaluation**: The paper emphasizes the issues of faithfulness and trustworthiness of the model in the evaluation process. Specifically, the author explores whether the model scores based on its actual knowledge and how it behaves when facing uncertain questions. #### Research background: With the development of automatic evaluation techniques, using LLMs to evaluate free - text generation tasks is low - cost and highly efficient. However, current research shows that although LLMs perform well in generation tasks, their performance in evaluation tasks is not satisfactory. Therefore, it is necessary to deeply explore the relationship between LLMs in generation and evaluation tasks to ensure the reliability and accuracy of their evaluation results. #### Research methods: - **Dataset**: The author uses the TriviaQA dataset for experiments, which is an open - domain question - answer dataset. - **Experimental setup**: By comparing the performance of different models in generation tasks and evaluation tasks, the author analyzes the performance differences of the models in different tasks. - **Evaluation metrics**: Including evaluation accuracy and evaluation faithfulness. #### Conclusions: The research results show that LLMs generally perform worse in evaluation tasks than in generation tasks, especially when facing low - quality answers, the evaluation accuracy is low. In addition, there are untrustworthy behaviors in the model evaluation process, that is, they sometimes give accurate evaluation results in areas where they are not good at. Therefore, the author emphasizes the need for more in - depth research on the evaluation ability and faithfulness of LLMs to ensure their reliability and effectiveness in practical applications. In short, this paper reveals the complex relationship between generative AI models in generation and evaluation tasks and proposes directions for future research, especially in exploring the correlation between generation ability and evaluation ability.

The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Self-Evaluation Improves Selective Generation in Large Language Models

The Generative AI Paradox: "What It Can Create, It May Not Understand"

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Can Large Language Models Be an Alternative to Human Evaluations?

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Rethinking Model Evaluation as Narrowing the Socio-Technical Gap

Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

Transforming Assessment: The Impacts and Implications of Large Language Models and Generative AI

The Origins of Generative AI in Transcription and Machine Translation, and Why That Matters

The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

On the assessment of generative AI in modeling tasks: an experience report with ChatGPT and UML

Retrieving Supporting Evidence for Generative Question Answering