VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

David Brandfonbrener,Simon Henniger,Sibi Raja,Tarun Prasad,Chloe Loughridge,Federico Cassano,Sabrina Ruixin Hu,Jianang Yang,William E. Byrd,Robert Zinkov,Nada Amin
2024-05-24
Abstract:Large Language Models (LLMs) can generate useful code, but often the code they generate cannot be trusted to be sound. In this paper, we present VerMCTS, an approach to begin to resolve this issue by generating verified programs in Dafny and Coq. VerMCTS uses a logical verifier in concert with an LLM to guide a modified Monte Carlo Tree Search (MCTS). This approach leverages the verifier to gain intermediate feedback inside the search algorithm by checking partial programs at each step to estimate an upper bound on the value function. To measure the performance of VerMCTS, we develop a new suite of multi-step verified programming problems in Dafny and Coq. In terms of pass@T, a new metric which computes the pass rate given a budget of T tokens sampled from the LLM, VerMCTS leads to more than a 30% absolute increase in average pass@5000 across the suite over repeated sampling from the base language model. Our code and benchmarks are available at
Software Engineering,Artificial Intelligence,Machine Learning,Logic in Computer Science,Programming Languages
What problem does this paper attempt to address?
The problem addressed in this paper is how to ensure that code generated by large language models (LLMs) is verifiable and correct, thus reducing the burden of users checking and regenerating the code. To solve this problem, the paper introduces the Verifier Monte Carlo Tree Search (VerMCTS) algorithm, which combines logical verifiers, large language models, and tree search techniques to synthesize verified programs in verification-aware programming languages such as Dafny and Coq. VerMCTS uses the verifier within the search algorithm to obtain intermediate feedback and checks the local program at each step to estimate the upper bound of the value function. The paper also develops a set of novel multi-step verification programming problems to evaluate the performance of VerMCTS. The results show that compared to the baseline language model, VerMCTS improves the average pass rate by over 30% within a given token budget. Furthermore, the paper proposes a new evaluation metric "pass@T" to measure the probability of successfully generating programs within a limited token budget, and compares it with various baseline methods including direct sampling, MCTS rollback, and advanced prompt engineering techniques using verifier error information. The experimental results demonstrate that VerMCTS performs the best in most of the problems, particularly in solving complex verification tasks, showing significant advantages.