Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.

What problem does this paper attempt to address?

This paper focuses on improving the ability of large-scale language models to solve mathematical problems. Although current large-scale language models perform well in problem-solving, their ability to solve mathematical problems is insufficient. To address this issue, the paper proposes the MathScale method, which is a simple and scalable strategy that uses state-of-the-art large-scale language models (such as GPT-3.5) to create high-quality mathematical reasoning data. This method is inspired by human cognitive mechanisms in mathematical learning. It first extracts topics and knowledge points from seed mathematical problems, constructs a concept graph, and then generates new mathematical problems based on randomly sampled concepts from the graph. The advantage of MathScale is its low reliance on original training examples and its ability to generate a large number of new mathematical problems. Using this method, they created the MathScaleQA dataset, which consists of two million pairs of mathematical problem-answer. In addition, they built a comprehensive mathematical problem benchmark called MWPBENCH, which covers different difficulty levels of mathematical problems from elementary school to university, to more comprehensively evaluate the model's mathematical reasoning ability. Experiments conducted on MWPBENCH show that open-source language models (such as LLaMA-2 and Mistral) fine-tuned with MathScaleQA have significantly improved mathematical reasoning ability. Specifically, MathScale-7B achieved a 42.9% increase in micro-average accuracy and a 43.7% increase in macro-average accuracy compared to the best peer model of equivalent scale on all datasets. In summary, this paper addresses the issue of effectively enhancing the ability of large-scale language models in solving mathematical problems. It proposes an innovative data generation and model fine-tuning method, and establishes a unified evaluation framework to promote fairer and more consistent model comparisons.

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

Benchmarking Large Language Models for Math Reasoning Tasks

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Evaluating Mathematical Reasoning Beyond Accuracy

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning

AI-Assisted Generation of Difficult Math Questions

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning