A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems

Roozbeh Yousefzadeh,Xuenan Cao
2024-11-28
Abstract:Using AI to write formal proofs for mathematical problems is a challenging task that has seen some advancements in recent years. Automated systems such as Lean can verify the correctness of proofs written in formal language, yet writing the proofs in formal language can be challenging for humans and machines. The miniF2F benchmark has 20 IMO problems in its testing set, yet formal proofs are available only for 7 of these problems (3 of which are written only by mathematicians). The model with best accuracy can only prove 4 of these 20 IMO problems, from 1950s and 60s, while its training set is a secret. In this work, we write complete, original formal proofs for the remaining 13 IMO problems in Lean along with 3 extra problems from IMO 2022 and 2023. This effort expands the availability of proof currently in the public domain by creating 5,150 lines of Lean proof. The goal of the paper is to pave the way for developing AI models that can automatically write the formal proofs for all the IMO problems in miniF2F and beyond. In this pursuit, we devise a method to decompose the proof of these problems into their building blocks, constructing a dataset of about 900 lemmas with 25,500 lines of Lean code. These lemmas are not trivial, yet they are approachable, providing the opportunity to evaluate and diagnose the failures and successes of AI models. We then evaluate the ability of GPT-4 in writing formal proofs for these lemmas with zero shot prompting, CoT reasoning and lemma retrieval. In evaluating the responses, we also analyze the confounding factor of LLM's ability to write the proofs in natural language vs Lean language.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use AI to automatically write formal proofs for problems in the International Mathematical Olympiad (IMO). Specifically, the paper focuses on the challenging task of automatically writing proofs for mathematical problems in a formal language, especially for IMO problems. Although some progress has been made in this area in recent years, automatic systems such as Lean can verify the correctness of proofs written in a formal language, but writing these proofs themselves is a challenge for both humans and automatic systems. The paper points out that the miniF2F dataset, which serves as a benchmark set in the field of automated theorem proving in the community, contains 20 IMO problems, but only 7 of them have formal proofs (3 of which were written by mathematicians). The currently best - performing model can only prove 4 of these 20 IMO problems, and these problems are from the 1950s and 1960s. In addition, the training set of this model is a secret. To solve this problem, the authors wrote complete original formal proofs for the remaining 13 IMO problems and added 3 additional IMO problems from 2022 and 2023. This work has greatly expanded the proofs available in the public domain, creating 5,150 lines of Lean proof code. The goal of the paper is to pave the way for the development of AI models that can automatically write formal proofs for all miniF2F and more IMO problems. To this end, the authors designed a method to decompose the proofs of these problems into their building blocks, constructing a dataset containing about 900 lemmas, with a total of 25,500 lines of Lean code. These lemmas, although not simple, are relatively accessible, providing opportunities to evaluate and diagnose the success and failure of AI models. Subsequently, the authors evaluated the ability of GPT - 4 to write formal proofs for these lemmas, including zero - sample prompting, chain - of - thought reasoning under expert feedback, and lemma retrieval. When evaluating the responses, the confounding factors between the ability of large - language models (LLMs) to write natural - language proofs and Lean - language proofs were also analyzed. Through these evaluations, IMO problems are divided into two groups: one group consists of problems before the 1970s. LLMs have a relatively high success rate in proving the decomposed lemmas of such problems, and can explain the correct proofs in natural language. When they fail, expert feedback and lemma retrieval can improve their answers; the other group consists of more recent IMO problems, whose proofs are longer and whose topics are more challenging. LLMs have a lower success rate in proving the decomposed lemmas, and cannot explain the correct proofs in natural language. Lemma retrieval and chain - of - thought reasoning can rarely improve the accuracy of their formal proofs. In summary, this paper aims to promote the development of AI in automatically writing formal proofs for complex mathematical problems by providing high - quality datasets and detailed evaluations.