VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

David Brandfonbrener,Simon Henniger,Sibi Raja,Tarun Prasad,Chloe Loughridge,Federico Cassano,Sabrina Ruixin Hu,Jianang Yang,William E. Byrd,Robert Zinkov,Nada Amin

2024-05-24

Abstract:Large Language Models (LLMs) can generate useful code, but often the code they generate cannot be trusted to be sound. In this paper, we present VerMCTS, an approach to begin to resolve this issue by generating verified programs in Dafny and Coq. VerMCTS uses a logical verifier in concert with an LLM to guide a modified Monte Carlo Tree Search (MCTS). This approach leverages the verifier to gain intermediate feedback inside the search algorithm by checking partial programs at each step to estimate an upper bound on the value function. To measure the performance of VerMCTS, we develop a new suite of multi-step verified programming problems in Dafny and Coq. In terms of pass@T, a new metric which computes the pass rate given a budget of T tokens sampled from the LLM, VerMCTS leads to more than a 30% absolute increase in average pass@5000 across the suite over repeated sampling from the base language model. Our code and benchmarks are available at

Software Engineering,Artificial Intelligence,Machine Learning,Logic in Computer Science,Programming Languages

What problem does this paper attempt to address?

The problem addressed in this paper is how to ensure that code generated by large language models (LLMs) is verifiable and correct, thus reducing the burden of users checking and regenerating the code. To solve this problem, the paper introduces the Verifier Monte Carlo Tree Search (VerMCTS) algorithm, which combines logical verifiers, large language models, and tree search techniques to synthesize verified programs in verification-aware programming languages such as Dafny and Coq. VerMCTS uses the verifier within the search algorithm to obtain intermediate feedback and checks the local program at each step to estimate the upper bound of the value function. The paper also develops a set of novel multi-step verification programming problems to evaluate the performance of VerMCTS. The results show that compared to the baseline language model, VerMCTS improves the average pass rate by over 30% within a given token budget. Furthermore, the paper proposes a new evaluation metric "pass@T" to measure the probability of successfully generating programs within a limited token budget, and compares it with various baseline methods including direct sampling, MCTS rollback, and advanced prompt engineering techniques using verifier error information. The experimental results demonstrate that VerMCTS performs the best in most of the problems, particularly in solving complex verification tasks, showing significant advantages.

VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo

Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS

VeCoGen: Automating Generation of Formally Verified C Code with Large Language Models

Towards Automated Verification of LLM-Synthesized C Programs

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

LEVER: Learning to Verify Language-to-Code Generation with Execution

Scalable Verification Framework for C Program

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast

Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset

CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization

Evaluating Large Language Models for Automatic Register Transfer Logic Generation via High-Level Synthesis

Towards AI-Assisted Synthesis of Verified Dafny Methods

VeriGen: A Large Language Model for Verilog Code Generation

VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency

LLM4VV: Developing LLM-driven testsuite for compiler validation

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency