CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

Zicheng Lin,Zhibin Gou,Tian Liang,Ruilin Luo,Haowei Liu,Yujiu Yang
2024-06-01
Abstract:The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of evaluating the critical reasoning capabilities of large language models (LLMs). Specifically, the paper introduces a benchmark framework called CRITIC BENCH, which is used to comprehensively assess the critical and corrective abilities of LLMs across different tasks. CRITIC BENCH covers five reasoning domains: mathematics, common sense, symbolic, coding, and algorithmic, and integrates 15 datasets and responses from three LLM families. Through this benchmark, researchers evaluated the performance of 17 different LLMs in generative, critical, and corrective reasoning (GQC reasoning) and analyzed the key factors affecting the critical reasoning of LLMs. The main findings include: 1. There is a linear relationship between LLMs' generative, critical, and corrective (GQC) abilities, and focusing on critical training can significantly enhance their performance; 2. Different types of tasks have varying impacts on the effectiveness of criticism and correction, with logic-oriented tasks being more easily corrected; 3. As the model size increases, GQC knowledge inconsistency decreases; 4. Stronger models perform better when critiquing weaker models, while weaker models can sometimes surpass stronger models in self-critique. These insights can promote the research progress of LLMs' critical and self-improvement capabilities.