Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Huaiyuan Ying,Zijian Wu,Yihan Geng,Jiayu Wang,Dahua Lin,Kai Chen
2024-06-08
Abstract:Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at <a class="link-external link-https" href="https://github.com/InternLM/InternLM-Math" rel="external noopener nofollow">this https URL</a> and our data at <a class="link-external link-https" href="https://huggingface.co/datasets/InternLM/Lean-Workbook" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the limitations of large-scale language models (LLMs) in mathematical theorem proving, particularly when using formal languages like Lean. The main challenge currently is the lack of training data. To overcome this issue, the paper proposes an iterative pipeline for generating and filtering synthetic data, translating natural language mathematical problems into Lean 4 statements and vice versa. This approach improves the performance of LLMs by automatically translating and understanding complex mathematical problems and proofs. The paper introduces a large-scale collection of Lean problems called the "Lean Workbook", consisting of approximately 57,000 pairs of formalized and informalized problems, as well as proofs collected from mathematical competition forums. The researchers utilize active learning methods to train the auto-formalization model and have made the code and dataset open source. After an initial data collection, the model translates natural language questions into formal language statements, which are then validated for accuracy through back-translation, Lean compiler checks, and natural language inference. Invalid formalizations are flagged and corrected by human experts before being added to the training set. Through multiple iterations, the translation accuracy of the model improves, resulting in the formalized statements of 21 new International Mathematical Olympiad (IMO) problems. The main contributions of the paper are as follows: 1. Proposing an active learning-based automatic formalization pipeline. 2. Open-sourcing the translation model and pipeline for automatic formalization in various mathematical topics. 3. Providing a dataset of 57,000 formalized mathematical problems, including 5,000 with formal solutions, for automatic formalization and automated theorem proving. 4. Formalizing 21 new IMO problems that were not present in previous datasets. The paper also discusses related work on methods for automatic formalization and automated theorem proving, and provides detailed descriptions of the various stages of the data construction pipeline, including data diagnosis, iteration, and filtering processes. Finally, the paper evaluates the performance of the model and presents accuracy results for different benchmarks. Despite overall high accuracy, there are still some error patterns that need further improvement.