CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

Zicheng Lin,Zhibin Gou,Tian Liang,Ruilin Luo,Haowei Liu,Yujiu Yang

2024-06-01

Abstract:The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issue of evaluating the critical reasoning capabilities of large language models (LLMs). Specifically, the paper introduces a benchmark framework called CRITIC BENCH, which is used to comprehensively assess the critical and corrective abilities of LLMs across different tasks. CRITIC BENCH covers five reasoning domains: mathematics, common sense, symbolic, coding, and algorithmic, and integrates 15 datasets and responses from three LLM families. Through this benchmark, researchers evaluated the performance of 17 different LLMs in generative, critical, and corrective reasoning (GQC reasoning) and analyzed the key factors affecting the critical reasoning of LLMs. The main findings include: 1. There is a linear relationship between LLMs' generative, critical, and corrective (GQC) abilities, and focusing on critical training can significantly enhance their performance; 2. Different types of tasks have varying impacts on the effectiveness of criticism and correction, with logic-oriented tasks being more easily corrected; 3. As the model size increases, GQC knowledge inconsistency decreases; 4. Stronger models perform better when critiquing weaker models, while weaker models can sometimes surpass stronger models in self-critique. These insights can promote the research progress of LLMs' critical and self-improvement capabilities.

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

Critique Ability of Large Language Models

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

CriticEval: Evaluating Large Language Model as Critic

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic

Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying

Are Your LLMs Capable of Stable Reasoning?

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

LLMs for Relational Reasoning: How Far are We?

LawBench: Benchmarking Legal Knowledge of Large Language Models

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

A NotSo Simple Way to Beat Simple Bench

LLMs cannot find reasoning errors, but can correct them given the error location