Abstract:Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at <a class="link-external link-https" href="https://github.com/InternLM/InternLM-Math" rel="external noopener nofollow">this https URL</a> and our data at <a class="link-external link-https" href="https://huggingface.co/datasets/InternLM/Lean-Workbook" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address the limitations of large-scale language models (LLMs) in mathematical theorem proving, particularly when using formal languages like Lean. The main challenge currently is the lack of training data. To overcome this issue, the paper proposes an iterative pipeline for generating and filtering synthetic data, translating natural language mathematical problems into Lean 4 statements and vice versa. This approach improves the performance of LLMs by automatically translating and understanding complex mathematical problems and proofs. The paper introduces a large-scale collection of Lean problems called the "Lean Workbook", consisting of approximately 57,000 pairs of formalized and informalized problems, as well as proofs collected from mathematical competition forums. The researchers utilize active learning methods to train the auto-formalization model and have made the code and dataset open source. After an initial data collection, the model translates natural language questions into formal language statements, which are then validated for accuracy through back-translation, Lean compiler checks, and natural language inference. Invalid formalizations are flagged and corrected by human experts before being added to the training set. Through multiple iterations, the translation accuracy of the model improves, resulting in the formalized statements of 21 new International Mathematical Olympiad (IMO) problems. The main contributions of the paper are as follows: 1. Proposing an active learning-based automatic formalization pipeline. 2. Open-sourcing the translation model and pipeline for automatic formalization in various mathematical topics. 3. Providing a dataset of 57,000 formalized mathematical problems, including 5,000 with formal solutions, for automatic formalization and automated theorem proving. 4. Formalizing 21 new IMO problems that were not present in previous datasets. The paper also discusses related work on methods for automatic formalization and automated theorem proving, and provides detailed descriptions of the various stages of the data construction pipeline, including data diagnosis, iteration, and filtering processes. Finally, the paper evaluates the performance of the model and presents accuracy results for different benchmarks. Despite overall high accuracy, there are still some error patterns that need further improvement.

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover

InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems

A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems

Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean 4

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers

LeanAgent: Lifelong Learning for Formal Theorem Proving

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Herald: A Natural Language Annotated Lean 4 Dataset

Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks