TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

Ruida Wang,Jipeng Zhang,Yizhen Jia,Rui Pan,Shizhe Diao,Renjie Pi,Tong Zhang
2024-10-04
Abstract:Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. However, due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data most modern LLMs exhibit suboptimal <a class="link-external link-http" href="http://performance.This" rel="external noopener nofollow">this http URL</a> scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address these challenges, this paper proposes TheoremLlama, an end-to-end framework that trains a general-purpose LLM to be a Lean4 expert. TheoremLlama includes NL-FL dataset generation and bootstrapping method to obtain aligned dataset, curriculum learning and block training techniques to train the model, and iterative proof writing method to write Lean4 proofs that work together synergistically. Using the dataset generation method in TheoremLlama, we provide Open Bootstrapped Theorems (OBT), an NL-FL aligned and bootstrapped dataset. Our novel NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leverages the NL reasoning ability of LLMs for formal reasoning. The TheoremLlama framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. Our code, model checkpoints, and the generated dataset is published in GitHub
Formal Languages and Automata Theory,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to transform general large language models (LLMs) into Lean4 theorem - proving experts. Specifically, the paper focuses on the following challenges: 1. **Data Scarcity Problem**: Due to the scarcity of data aligned between natural languages (NL) and formal languages (FL), most modern LLMs perform poorly on formal theorem - proving tasks. This lack of data restricts the development of methods and techniques for training LLMs, making it difficult for them to fully utilize their capabilities in constructing formal proofs. 2. **Difficulty in Direct Transfer**: Although modern LLMs perform well in natural - language reasoning, due to the significant differences between formal languages (such as Lean4) and natural languages, it is not feasible to directly transfer natural - language reasoning capabilities to formal - language reasoning. In addition, the pre - training data of LLMs may contain confusing Lean3 code, which further exacerbates this problem. 3. **Complexity of Formal Proofs**: Formal proofs involve a great deal of repetitive and tedious work, which is not common for mathematicians accustomed to high - level proofs. Therefore, the demand for automated formal theorem - proving is increasing day by day, but existing methods usually rely on searching for possible strategies in an infinite space to complete the proof, resulting in high computational costs. To address these challenges, the paper proposes the **TheoremLlama** framework, which achieves the transformation from a general LLM to a Lean4 expert through the synergy of the following three main components: 1. **NL - FL Aligned Data Generation**: By extracting 100,000 high - quality, manually - written proofs from Mathlib4 and informalizing them using Gemini - 1.5 and retrieved examples, the Open Bootstrapped Theorems (OBT) dataset is generated. In addition, the natural - language reasoning is embedded into Lean4 code through the NL - FL bootstrapping method, helping the LLM better understand theorems and utilize its natural - language reasoning capabilities for formal reasoning. 2. **Lean4 Prover Training**: The block - training technique is introduced to enhance the LLM's context - learning ability, and a curriculum data - ordering strategy is adopted to ensure a smooth training process. These techniques enable the LLM to better learn unfamiliar Lean4 theorem - proving tasks. 3. **Iterative Proof Writing**: By using previously generated correct proofs as context examples, the LLM's formal - reasoning ability is gradually improved. This method effectively reduces the distribution difference between the generated data and real - world natural - language statements and proofs. Through these methods, the TheoremLlama framework achieves cumulative accuracies of 36.48% and 33.61% on the MiniF2F - Valid and MiniF2F - Test datasets respectively, significantly exceeding the GPT - 4 baseline (22.95% and 25.41%). These results indicate that TheoremLlama not only solves the data - scarcity problem but also effectively improves the performance of the LLM in formal - theorem - proving tasks.