LLM Critics Help Catch LLM Bugs

Nat McAleese,Rai Michael Pokorny,Juan Felipe Ceron Uribe,Evgenia Nitishinskaya,Maja Trebacz,Jan Leike

2024-06-29

Abstract:Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains "critic" models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.

Software Engineering,Machine Learning

What problem does this paper attempt to address?

The paper aims to address a fundamental limitation in Reinforcement Learning from Human Feedback (RLHF), which is the insufficiency of human evaluators' ability to accurately assess the outputs generated by large language models (LLMs), especially in complex tasks such as code writing. As the capabilities of the models increase, even experienced experts find it challenging to reliably evaluate the quality or correctness of their outputs. This not only limits the effectiveness of RLHF but also potentially leads to the development of dangerous strategies if systematic human evaluation errors are strongly optimized. To address this issue, the paper proposes a method for training "critic" models that can assist humans in more accurately evaluating the code generated by models. Specifically, the paper demonstrates how to train these critic models through RLHF to identify errors in model-generated code, and in practical applications, the critiques generated by these models are preferred over those by human contractors and are better at identifying errors. Additionally, the paper explores the effectiveness of human-machine collaborative teams, finding that such teams can produce more comprehensive critiques while reducing hallucination errors. Overall, the goal of the paper is to improve the accuracy and comprehensiveness of human evaluations of LLM outputs by introducing well-trained critic models, thereby overcoming the fundamental limitations of RLHF.

LLM Critics Help Catch LLM Bugs

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Evaluating LLMs at Detecting Errors in LLM Responses

CriticAL: Critic Automation with Language Models

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

CriticEval: Evaluating Large Language Model as Critic

Critique Ability of Large Language Models

A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics

LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

AI-powered Code Review with LLMs: Early Results

Easy Problems That LLMs Get Wrong

CritiqueLLM: Scaling LLM-as-Critic for Effective and Explainable Evaluation of Large Language Model Generation

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

Evaluating Language Models for Generating and Judging Programming Feedback

Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback

Are You Human? An Adversarial Benchmark to Expose LLMs

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests