Abstract:Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selection: a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): an LLM-based prover augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 98,734 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key obstacles in the existing theorem - proving methods based on large - language models (LLMs), including: 1. **Code and data privatization**: Most of the existing theorem provers based on LLMs use private code and data sets, which makes these methods difficult to reproduce or further build upon. 2. **High computational cost**: These methods usually require a large amount of computational resources, such as thousands of GPU - days of training time, which limits the participation of researchers. 3. **Lack of public benchmarks**: There are no public benchmark data sets to evaluate and compare different theorem - proving methods, which hinders the progress in this field. To overcome these problems, the paper introduces LeanDojo, an open - source Lean (a popular mathematical proof - assisting tool) toolkit with the following main functions: - **Data extraction**: Extract training data from Lean, including file dependencies, abstract syntax trees (AST), proof states and tactics, as well as the definition and usage location of premises. - **Interaction support**: Provide a programming interface for interacting with Lean, enabling the model to observe the proof state, execute tactics and receive feedback, thus supporting the training and evaluation of the model. - **Premise selection**: Through a retrieval - augmented method, help the model select appropriate premises during the proof process, thereby improving the proof efficiency and accuracy. Specifically, LeanDojo solves the following problems: - **Openness of data and tools**: Provide open - source data extraction tools and an interactive environment, reducing the research threshold. - **Efficient proof methods**: Develop the ReProver model, which selects premises through a retrieval - augmented method, reducing the dependence on a large amount of computational resources. - **Challenging benchmark data set**: Construct a benchmark data set containing 98,734 theorems and proofs, and design a more challenging data splitting method, requiring the model to be able to generalize to unseen premises during testing. Through these contributions, LeanDojo provides new tools and benchmarks for machine - learning research in the field of theorem proving, promoting the openness and further development of this field.

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Towards Large Language Models as Copilots for Theorem Proving in Lean

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

LeanAgent: Lifelong Learning for Formal Theorem Proving

InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems

LEGO-Prover: Neural Theorem Proving with Growing Libraries

LeanReasoner: Boosting Complex Logical Reasoning with Lean

LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover

Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

Proof Automation with Large Language Models

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

Lean-STaR: Learning to Interleave Thinking and Proving

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems

ImProver: Agent-Based Automated Proof Optimization

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

NaturalProver: Grounded Mathematical Proof Generation with Language Models

Leveraging Large Language Models for Automated Proof Synthesis in Rust

MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data