Proof Automation with Large Language Models

Minghai Lu,Benjamin Delaware,Tianyi Zhang
2024-09-22
Abstract:Interactive theorem provers such as Coq are powerful tools to formally guarantee the correctness of software. However, using these tools requires significant manual effort and expertise. While Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language, they are less effective at generating formal proofs in interactive theorem provers. In this paper, we conduct a formative study to identify common mistakes made by LLMs when asked to generate formal proofs. By analyzing 520 proof generation errors made by GPT-3.5, we found that GPT-3.5 often identified the correct high-level structure of a proof, but struggled to get the lower-level details correct. Based on this insight, we propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems. We evaluate PALM on a large dataset that includes more than 10K theorems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems. Moreover, PALM proves 1270 theorems beyond the reach of existing approaches. We also demonstrate the generalizability of PALM across different LLMs.
Software Engineering,Artificial Intelligence,Machine Learning,Logic in Computer Science,Programming Languages
What problem does this paper attempt to address?
This paper aims to address the issues encountered by Large Language Models (LLMs) in generating formal proofs. Specifically, although LLMs perform well in generating natural language informal proofs, they are less effective in automatically generating formal proofs in interactive theorem provers like Coq. By analyzing 520 proof errors generated by GPT-3.5, the authors found that these models are generally able to identify the correct high-level proof structure but struggle with handling low-level details. Based on this observation, the paper proposes PALM (Proof Automation with Large Language Models), an innovative "generate-and-repair" approach that first uses an LLM to generate an initial proof and then employs targeted symbolic methods to iteratively fix low-level issues. Experimental results show that PALM significantly outperforms existing methods on a dataset containing over 10,000 theorems and successfully proves 1,270 theorems that existing methods could not. Additionally, the generality of PALM across different LLMs has also been validated.