Abstract:Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.

How Well Do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation.

Deep Learning in Automatic Math Word Problem Solvers

Measuring Mathematical Problem Solving With the MATH Dataset

The Gap of Semantic Parsing: A Survey on Automatic Math Word Problem Solvers

MathDQN: Solving Arithmetic Word Problems Via Deep Reinforcement Learning.

Enhancing Seq2seq Math Word Problem Solver with Entity Information and Math Knowledge

Training Verifiers to Solve Math Word Problems

Automatically Solving Number Word Problems by Semantic Parsing and Reasoning

MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

Explaining Math Word Problem Solvers

Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Data-Driven Methods for Solving Algebra Word Problems