MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula

Shubhra Mishra,Gabriel Poesia,Belinda Mo,Noah D. Goodman
2024-07-01
Abstract:Mathematical problem solving is an important skill for Large Language Models (LLMs), both as an important capability and a proxy for a range of reasoning abilities. Existing benchmarks probe a diverse set of skills, but they yield aggregate accuracy metrics, obscuring specific abilities or weaknesses. Furthermore, they are difficult to extend with new problems, risking data contamination over time. To address these challenges, we propose MathCAMPS: a method to synthesize high-quality mathematical problems at scale, grounded on 44 fine-grained "standards" from the Mathematics Common Core (CC) Standard for K-8 grades. We encode each standard in a formal grammar, allowing us to sample diverse symbolic problems and their answers. We then use LLMs to realize the symbolic problems into word problems. We propose a cycle-consistency method for validating problem faithfulness. Finally, we derive follow-up questions from symbolic structures and convert them into follow-up word problems - a novel task of mathematical dialogue that probes for robustness in understanding. Experiments on 23 LLMs show surprising failures even in the strongest models (in particular when asked simple follow-up questions). Moreover, we evaluate training checkpoints of Pythia 12B on MathCAMPS, allowing us to analyze when particular mathematical skills develop during its training. Our framework enables the community to reproduce and extend our pipeline for a fraction of the typical cost of building new high-quality datasets.
Artificial Intelligence
What problem does this paper attempt to address?
This paper proposes a solution to the challenges that arise when evaluating the mathematical problem-solving capabilities of large language models (LLMs). Existing benchmark tests, although diverse, only provide overall accuracy and fail to reveal specific strengths or weaknesses. They are also difficult to extend to new problems, which may result in data contamination. To address these challenges, the paper introduces MathCAMPS, a method for synthesizing a large number of high-quality mathematical problems based on the common core standards from elementary to middle school mathematics. Each standard is encoded using formal grammar to generate various symbolic problems and their corresponding answers, which are then transformed into natural language questions using language models. The paper also introduces a cycle-consistency verification method to ensure the authenticity of the problems and evaluate the robustness of the model's understanding through subsequent questions. The experiments demonstrate that even the most powerful models exhibit surprising failures on simple follow-up questions, indicating significant performance gaps. Additionally, an analysis of the training checkpoints of Pythia 12B allows for studying the development of specific mathematical skills during the training process. The MathCAMPS framework enables the community to replicate and extend the generation pipeline at a lower cost, creating new high-quality datasets. In summary, the paper addresses the problem of establishing a scalable, refined, and authentic benchmark for assessing LLMs' mathematical reasoning abilities. It aims to reveal the specific performance of models on different mathematical skills in order to better understand and improve their performance.