V-STaR: Training Verifiers for Self-Taught Reasoners

Arian Hosseini,Xingdi Yuan,Nikolay Malkin,Aaron Courville,Alessandro Sordoni,Rishabh Agarwal
2024-08-14
Abstract:Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue that existing large language models (LLMs) only use correct answers for fine-tuning during self-improvement, discarding a large number of incorrect answers, which may lead to valuable information being overlooked. To solve this problem, the paper proposes the V-STaR method, which utilizes both correct and incorrect answers during the self-training process to train a verifier that can judge the correctness of the model-generated solutions. In this way, V-STaR not only improves the model's reasoning ability but also effectively utilizes the information in incorrect answers, thereby achieving significant performance improvements in tasks such as mathematical problem-solving and code generation. Specifically, V-STaR iteratively generates solutions and uses these solutions (including both correct and incorrect ones) to train the verifier, enabling the model to learn from its own mistakes and thus enhance its problem-solving capabilities.