The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate

Juhyun Oh, Eunsu Kim, Inha Cha, Alice Oh
2024-02-09
Abstract:This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of "the Generative AI Paradox" (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper explores the performance differences between large language models (LLMs) in generation tasks and evaluation tasks, especially whether they can effectively serve as evaluators for a certain generation task when they perform well in that specific generation task. Specifically, the author focuses on the "Generative AI Paradox", that is, although some generative AI models perform well in generation tasks, they may perform poorly in evaluation tasks. #### Main problems: 1. **Inconsistency between generation and evaluation abilities**: The paper studies whether there are significant differences in the performance of LLMs in generation tasks and evaluation tasks. By conducting experiments on three LLMs (GPT - 3.5, GPT - 4, PaLM - 2) and an open - source language model (Vicuna - 13b), the author finds that these models generally perform better in generation tasks than in evaluation tasks. 2. **Unreliable evaluation phenomenon**: The author discovers the phenomenon of "unreliable evaluation", that is, the model accurately evaluates answers in some areas, but these areas are actually not what the model is good at. This indicates that the credibility and reliability of the model in evaluation tasks need further examination. 3. **Faithfulness in evaluation**: The paper emphasizes the issues of faithfulness and trustworthiness of the model in the evaluation process. Specifically, the author explores whether the model scores based on its actual knowledge and how it behaves when facing uncertain questions. #### Research background: With the development of automatic evaluation techniques, using LLMs to evaluate free - text generation tasks is low - cost and highly efficient. However, current research shows that although LLMs perform well in generation tasks, their performance in evaluation tasks is not satisfactory. Therefore, it is necessary to deeply explore the relationship between LLMs in generation and evaluation tasks to ensure the reliability and accuracy of their evaluation results. #### Research methods: - **Dataset**: The author uses the TriviaQA dataset for experiments, which is an open - domain question - answer dataset. - **Experimental setup**: By comparing the performance of different models in generation tasks and evaluation tasks, the author analyzes the performance differences of the models in different tasks. - **Evaluation metrics**: Including evaluation accuracy and evaluation faithfulness. #### Conclusions: The research results show that LLMs generally perform worse in evaluation tasks than in generation tasks, especially when facing low - quality answers, the evaluation accuracy is low. In addition, there are untrustworthy behaviors in the model evaluation process, that is, they sometimes give accurate evaluation results in areas where they are not good at. Therefore, the author emphasizes the need for more in - depth research on the evaluation ability and faithfulness of LLMs to ensure their reliability and effectiveness in practical applications. In short, this paper reveals the complex relationship between generative AI models in generation and evaluation tasks and proposes directions for future research, especially in exploring the correlation between generation ability and evaluation ability.