MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Meng Fang,Xiangpeng Wan,Fei Lu,Fei Xing,Kai Zou

2024-06-26

Abstract:Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

This paper mainly explores the capability of large-scale language models (LLMs) in solving mathematical problems. Although LLMs have made significant progress in natural language understanding and some regular mathematical problems, they still face challenges in handling mathematical problems that require complex reasoning. To address this, researchers have developed a new dataset called "MathOdyssey," which includes diverse mathematical problems at high school and college levels. It aims to test the ability of LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. The MathOdyssey dataset is carefully designed by experts to evaluate and improve the AI's ability to solve complex mathematical problems. The paper benchmarked both open-source and closed-source LLMs and found that while these models perform well on regular tasks and moderately difficult problems, they struggle with Olympiad-level and complex college-level problems. The study also discovered that the performance gap between open-source and closed-source models is narrowing but still significant challenges remain. By comparing different models, the paper emphasizes the value of the MathOdyssey dataset in evaluating the mathematical reasoning abilities of LLMs and highlights the need for future research to improve the mathematical reasoning capabilities of LLMs. Additionally, the authors discuss the limitations of existing datasets, such as their exploitation by model training and the scarcity of high-quality original problems. In conclusion, this paper aims to drive progress in AI's ability to solve more complex mathematical problems through the MathOdyssey dataset, enabling more comprehensive and powerful artificial intelligence systems.

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

Benchmarking Large Language Models for Math Reasoning Tasks

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

Mamo: a Mathematical Modeling Benchmark with Solvers

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Augmenting Math Word Problems via Iterative Question Composing