Abstract:Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1 achieves 94.8% on MATH dataset), indicating their inadequacy for truly challenging these models. To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. These problems are meticulously categorized into over 33 sub-domains and span more than 10 distinct difficulty levels, enabling a holistic assessment of model performance in Olympiad-mathematical reasoning. Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing mathematical benchmarks (such as GSM8K or MATH) are not challenging enough for evaluating the mathematical reasoning ability of large - language models (LLMs). Although these models have achieved a high accuracy rate on existing benchmarks, they still face significant challenges when dealing with more difficult Olympic - level mathematics problems. To fill this gap, the paper proposes a comprehensive and challenging benchmark - Omni - MATH, which is specifically designed to evaluate the Olympic - level mathematical reasoning ability of LLMs. ### Main Contributions 1. **Proposing the Omni - MATH Benchmark**: - It contains more than 33 sub - fields and diverse difficulty levels, presenting new challenges to evaluate the LLMs' ability in problem - solving and complex reasoning. - The dataset contains 4,428 competition - level problems, which are strictly manually annotated and cover a wide range from entry - level to professional international competitions. 2. **Introducing Omni - Judge**: - This is a mathematical verifier designed specifically for highly challenging problems. Although its size is only 7B, it has more than 90% consistency with GPT - 4o, providing an efficient mathematical verification solution. 3. **Comprehensive Evaluation of Existing Models**: - A comprehensive evaluation was carried out on 15 of the current strongest LLMs. It was found that even the most advanced models (such as OpenAI o1 - mini and o1 - preview), when dealing with highly difficult Olympic - level mathematics problems, have an accuracy rate of only 60.54% and 52.55% respectively. - The mathematical disciplines were classified in detail, and new insights into the performance of current LLMs were provided. For example, LLMs perform slightly better in algebra but worse in discrete mathematics. ### Method Overview - **Data Collection and Annotation**: - Data was collected from global mathematics competitions and classified into five levels according to difficulty, scale, and reputation. - MathPix was used to convert PDF documents into LaTeX format, and data was extracted from the AoPS website. - The correctness and completeness of the data were verified by a manual annotation team. - **Difficulty Classification**: - With the help of the difficulty ratings provided by the AoPS website, the difficulty classification was carried out at the instance level for each problem. - The difficulty score ranges from 0 to 10, including increments of 0.5 and 0.25. - **Domain Classification**: - The mathematical fields were organized into a hierarchical tree structure to better study the performance of the model in different mathematical fields. - GPT - 4o was used to classify the problems. - **Evaluation Method**: - GPT - 4o was used to verify whether the answers generated by the model were consistent with the standard answers. - An open - source evaluation model, Omni - Judge, was developed, which has more than 90% consistency with GPT - 4o, providing a low - cost evaluation method. ### Experimental Results - **Model Performance**: - Even the most advanced models, such as OpenAI o1 - mini and o1 - preview, have an accuracy rate of only 60.54% and 52.55% respectively on Omni - MATH. - Open - source models such as Qwen2.5 - MATH have surpassed GPT - 4o in Olympic - level mathematical reasoning. - **Domain Analysis**: - The models show strong abilities in fields such as algebra, calculus, and number theory, but are weak in discrete mathematics. - It is speculated that this may be because algebra and calculus are more common in mathematical datasets, while there is less data in discrete mathematics. - **Limitations of Test - Time Expansion Techniques**: - The commonly used Best - of - N expansion technique is not effective for Olympic - level mathematics problems, and more effective test - time expansion methods need to be further studied. ### Conclusion The Omni - MATH benchmark shows the significant challenges of current LLMs in dealing with highly difficult Olympic - level mathematics problems, emphasizing the importance of further research and improvement of mathematical reasoning ability.

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Benchmarking Large Language Models for Math Reasoning Tasks

Mamo: a Mathematical Modeling Benchmark with Solvers

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

OmniBench: Towards The Future of Universal Omni-Language Models

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

Large Language Models for Mathematical Reasoning: Progresses and Challenges

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models