Towards AI-Assisted Synthesis of Verified Dafny Methods

Md Rakib Hossain Misu,Cristina V. Lopes,Iris Ma,James Noble
DOI: https://doi.org/10.1145/3643763
2024-06-11
Abstract:Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs' specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models' proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models...
Software Engineering,Programming Languages
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the potential and capability of large language models (LLMs) in generating formally verified code, particularly in the language Dafny, which supports formal verification. Specifically, the research focuses on the following questions: 1. **Contextless Prompting**: Can LLMs effectively generate formally verified Dafny methods from simple natural language descriptions? 2. **Signature Prompting**: How does the quality of the formally verified code generated by LLMs change when additional contextual information (such as method signatures and test cases) is provided? 3. **Dynamic Few-Shot Prompting**: Can the ability of LLMs to generate correct formally verified code be significantly improved by including examples that break down the problem step-by-step (Chain of Thought, CoT) and semantically similar example problems and their solutions? ### Overview of Research Design To evaluate the performance of LLMs in generating formally verified code, the researchers created a subset of the MBPP dataset (primarily targeting basic Python programming tasks) called MBPP-san-DFY. This dataset includes 178 problem descriptions, method signatures, and test cases translated into Dafny. Additionally, the researchers manually wrote 50 formally verified Dafny methods as examples (MBPP-DFY-50) for the dynamic few-shot prompting experiments. During the research, by comparing the performance of different LLMs (such as GPT-4, PaLM-2, etc.), it was ultimately found that GPT-4 performed the best in generating formally verified Dafny code. Especially under prompts that included step-by-step problem breakdowns (CoT), GPT-4 successfully generated correctly verified Dafny methods at a rate of 58%. Overall, this paper explores how to leverage LLMs to assist in generating formally verified code and demonstrates significant results under specific prompting strategies, providing valuable references for further research and applications.