ImProver: Agent-Based Automated Proof Optimization

Riyaz Ahuja,Jeremy Avigad,Prasad Tetali,Sean Welleck
2024-10-07
Abstract:Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.
Artificial Intelligence,Computation and Language,Machine Learning,Logic in Computer Science
What problem does this paper attempt to address?
The paper aims to address the problem of automated proof optimization, specifically by rewriting formal proofs of mathematical theorems to optimize specific criteria such as proof length or readability. The paper introduces the ImProver system, an agent based on large language models (LLM) that can rewrite proofs in the formal proof assistant tool Lean to optimize any user-defined metrics. The study found that directly applying large language models for proof optimization was ineffective, so the paper proposes a series of improvements, including: 1. **Chain-of-States Prompting**: Utilizing Lean's metaprogramming techniques to annotate intermediate states before each step, helping the model better understand intermediate goals and assumptions in the proof process. 2. **Output Format Adjustment**: Introducing different output formats, such as lists and tree structures, to generate more structured proofs. 3. **Sampling Method Improvement**: Introducing various sampling methods, such as best-of-n sampling and iterative improvement, to enhance the quality of generated proofs. 4. **Retrieval Enhancement**: Using Maximum Marginal Relevance (MMR) retrieval techniques to extract relevant information from existing databases for generating more accurate prompts. Experiments conducted on actual undergraduate theorems, competition problems, and research-level mathematical theorems showed that ImProver significantly outperformed the baseline GPT-4 model in optimizing proof length and readability. Additionally, the paper conducted ablation tests to verify the effectiveness of each component and derived the best parameter combinations. Overall, ImProver effectively optimizes mathematical proofs of various difficulty levels.