LLM Critics Help Catch LLM Bugs

Nat McAleese,Rai Michael Pokorny,Juan Felipe Ceron Uribe,Evgenia Nitishinskaya,Maja Trebacz,Jan Leike
2024-06-29
Abstract:Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains "critic" models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as "flawless", even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.
Software Engineering,Machine Learning
What problem does this paper attempt to address?
The paper aims to address a fundamental limitation in Reinforcement Learning from Human Feedback (RLHF), which is the insufficiency of human evaluators' ability to accurately assess the outputs generated by large language models (LLMs), especially in complex tasks such as code writing. As the capabilities of the models increase, even experienced experts find it challenging to reliably evaluate the quality or correctness of their outputs. This not only limits the effectiveness of RLHF but also potentially leads to the development of dangerous strategies if systematic human evaluation errors are strongly optimized. To address this issue, the paper proposes a method for training "critic" models that can assist humans in more accurately evaluating the code generated by models. Specifically, the paper demonstrates how to train these critic models through RLHF to identify errors in model-generated code, and in practical applications, the critiques generated by these models are preferred over those by human contractors and are better at identifying errors. Additionally, the paper explores the effectiveness of human-machine collaborative teams, finding that such teams can produce more comprehensive critiques while reducing hallucination errors. Overall, the goal of the paper is to improve the accuracy and comprehensiveness of human evaluations of LLM outputs by introducing well-trained critic models, thereby overcoming the fundamental limitations of RLHF.