MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula

Shubhra Mishra,Gabriel Poesia,Belinda Mo,Noah D. Goodman

2024-07-01

Abstract:Mathematical problem solving is an important skill for Large Language Models (LLMs), both as an important capability and a proxy for a range of reasoning abilities. Existing benchmarks probe a diverse set of skills, but they yield aggregate accuracy metrics, obscuring specific abilities or weaknesses. Furthermore, they are difficult to extend with new problems, risking data contamination over time. To address these challenges, we propose MathCAMPS: a method to synthesize high-quality mathematical problems at scale, grounded on 44 fine-grained "standards" from the Mathematics Common Core (CC) Standard for K-8 grades. We encode each standard in a formal grammar, allowing us to sample diverse symbolic problems and their answers. We then use LLMs to realize the symbolic problems into word problems. We propose a cycle-consistency method for validating problem faithfulness. Finally, we derive follow-up questions from symbolic structures and convert them into follow-up word problems - a novel task of mathematical dialogue that probes for robustness in understanding. Experiments on 23 LLMs show surprising failures even in the strongest models (in particular when asked simple follow-up questions). Moreover, we evaluate training checkpoints of Pythia 12B on MathCAMPS, allowing us to analyze when particular mathematical skills develop during its training. Our framework enables the community to reproduce and extend our pipeline for a fraction of the typical cost of building new high-quality datasets.

Artificial Intelligence

What problem does this paper attempt to address?

This paper proposes a solution to the challenges that arise when evaluating the mathematical problem-solving capabilities of large language models (LLMs). Existing benchmark tests, although diverse, only provide overall accuracy and fail to reveal specific strengths or weaknesses. They are also difficult to extend to new problems, which may result in data contamination. To address these challenges, the paper introduces MathCAMPS, a method for synthesizing a large number of high-quality mathematical problems based on the common core standards from elementary to middle school mathematics. Each standard is encoded using formal grammar to generate various symbolic problems and their corresponding answers, which are then transformed into natural language questions using language models. The paper also introduces a cycle-consistency verification method to ensure the authenticity of the problems and evaluate the robustness of the model's understanding through subsequent questions. The experiments demonstrate that even the most powerful models exhibit surprising failures on simple follow-up questions, indicating significant performance gaps. Additionally, an analysis of the training checkpoints of Pythia 12B allows for studying the development of specific mathematical skills during the training process. The MathCAMPS framework enables the community to replicate and extend the generation pipeline at a lower cost, creating new high-quality datasets. In summary, the paper addresses the problem of establishing a scalable, refined, and authentic benchmark for assessing LLMs' mathematical reasoning abilities. It aims to reveal the specific performance of models on different mathematical skills in order to better understand and improve their performance.

MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula

ControlMath: Controllable Data Generation Promotes Math Generalist Models

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Measuring Mathematical Problem Solving With the MATH Dataset

MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

AI-Assisted Generation of Difficult Math Questions

Augmenting Math Word Problems via Iterative Question Composing

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems