Abstract:Critical thinking is essential for rational decision-making and problem-solving. This skill hinges on the ability to provide precise and reasoned critiques and is a hallmark of human intelligence. In the era of large language models (LLMs), this study explores the ability of LLMs to deliver accurate critiques across various tasks. We are interested in this topic as a capable critic model could not only serve as a reliable evaluator, but also as a source of supervised signals for model tuning. Particularly, if a model can self-critique, it has the potential for autonomous self-improvement. To examine this, we introduce a unified evaluation framework for assessing the critique abilities of LLMs. We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses; and annotate the correctness of these responses. The benchmark cover tasks such as math problem-solving, code completion, and question answering. We evaluate multiple LLMs on the collected dataset and our analysis reveals several noteworthy insights: (1) Critique is generally challenging for most LLMs, and this capability often emerges only when models are sufficiently large. (2) In particular, self-critique is especially difficult. Even top-performing LLMs struggle to achieve satisfactory performance. (3) Models tend to have lower critique accuracy on problems where they are most uncertain. To this end, we introduce a simple yet effective baseline named self-check, which leverages self-critique to improve task performance for various models. We hope this study serves as an initial exploration into understanding the critique abilities of LLMs, and aims to inform future research, including the development of more proficient critic models and the application of critiques across diverse tasks.

Model Criticism for Long-Form Text Generation

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation

Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

RSTGen: Imbuing Fine-Grained Interpretable Control into Long-FormText Generators

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text

Evaluating Computational Language Models with Scaling Properties of Natural Language

Neural Net Models for Open-Domain Discourse Coherence

Finding Structure in Language Models

Assessing Language Models' Worldview for Fiction Generation

Transformer Models for Text Coherence Assessment

CriticAL: Critic Automation with Language Models

Language Model Behavior: A Comprehensive Survey

Collective Critics for Creative Story Generation

Language Model Evaluation Beyond Perplexity

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic

The Next Chapter: A Study of Large Language Models in Storytelling

CriticEval: Evaluating Large Language Model as Critic

Self-critiquing models for assisting human evaluators

Critique Ability of Large Language Models