Abstract:Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs' specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models' proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models...

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the potential and capability of large language models (LLMs) in generating formally verified code, particularly in the language Dafny, which supports formal verification. Specifically, the research focuses on the following questions: 1. **Contextless Prompting**: Can LLMs effectively generate formally verified Dafny methods from simple natural language descriptions? 2. **Signature Prompting**: How does the quality of the formally verified code generated by LLMs change when additional contextual information (such as method signatures and test cases) is provided? 3. **Dynamic Few-Shot Prompting**: Can the ability of LLMs to generate correct formally verified code be significantly improved by including examples that break down the problem step-by-step (Chain of Thought, CoT) and semantically similar example problems and their solutions? ### Overview of Research Design To evaluate the performance of LLMs in generating formally verified code, the researchers created a subset of the MBPP dataset (primarily targeting basic Python programming tasks) called MBPP-san-DFY. This dataset includes 178 problem descriptions, method signatures, and test cases translated into Dafny. Additionally, the researchers manually wrote 50 formally verified Dafny methods as examples (MBPP-DFY-50) for the dynamic few-shot prompting experiments. During the research, by comparing the performance of different LLMs (such as GPT-4, PaLM-2, etc.), it was ultimately found that GPT-4 performed the best in generating formally verified Dafny code. Especially under prompts that included step-by-step problem breakdowns (CoT), GPT-4 successfully generated correctly verified Dafny methods at a rate of 58%. Overall, this paper explores how to leverage LLMs to assist in generating formally verified code and demonstrates significant results under specific prompting strategies, providing valuable references for further research and applications.

Towards AI-Assisted Synthesis of Verified Dafny Methods

dafny-annotator: AI-Assisted Verification of Dafny Programs

Leveraging Large Language Models to Boost Dafny's Developers Productivity

DafnyBench: A Benchmark for Formal Software Verification

Case studies of development of verified programs with Dafny for accessibility assessment

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages

Assured Automatic Programming via Large Language Models

Baldur: Whole-Proof Generation and Repair with Large Language Models

VerMCTS: Synthesizing Multi-Step Programs using a Verifier, a Large Language Model, and Tree Search

Proof Automation with Large Language Models

DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast

Leveraging Large Language Models for Automated Proof Synthesis in Rust

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

NaturalProver: Grounded Mathematical Proof Generation with Language Models

LEVER: Learning to Verify Language-to-Code Generation with Execution

MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

Generative Language Modeling for Automated Theorem Proving