Abstract:Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH$^2$ - a dataset of higher-quality math questions, as evidenced by: (a) Lower performance of all models on MATH$^2$ than on MATH (b) Higher performance on MATH when using MATH$^2$ questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH$^2$ is the square on MATH, suggesting that successfully solving the question in MATH$^2$ requires a nontrivial combination of two distinct math skills.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the current training of large - language models (LLMs), mathematical reasoning ability is regarded as a core ability, but the existing public data sources have been fully utilized, resulting in the unmet demand for diverse and challenging mathematical problems. Relying solely on human experts to generate these problems is both time - consuming and costly, and the problems generated by LLMs often lack the necessary diversity and difficulty. Therefore, the paper proposes a design framework that combines LLMs with human participation, aiming to generate a series of diverse and challenging mathematical problems. Specifically, the goals of the paper include: 1. **Utilizing the capabilities of LLMs**: First, use powerful LLMs to extract core "skills" from existing mathematical datasets, and these skills serve as the basis for generating new problems. 2. **Generating high - quality problems**: By randomly combining different core skills, prompt the LLM to generate new problems that require the simultaneous application of multiple skills, which makes the generated problems "out - of - distribution" tasks for both LLMs and humans. 3. **Multi - round interactive generation and verification**: Through multi - round prompts, the LLM can iteratively generate and optimize problems and their solutions, and then be verified and further optimized by human annotators. 4. **Evaluating model performance**: Generate a new set of problems (called MATH2) by extracting skills from the MATH dataset, and evaluate the performance of different models on the new set of problems. The results show that the performance of all models on MATH2 is lower than that on MATH, indicating that the new set of problems is more challenging. The paper also discovered an interesting phenomenon: the success rate of the model on MATH2 is approximately equal to the square of its success rate on MATH, which indicates that solving the problems in MATH2 requires non - trivially combining two different mathematical skills. This finding not only helps to understand the generalization ability of the model, but also provides a new perspective for the future generation of mathematical problems.

AI-Assisted Generation of Difficult Math Questions

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Neuro-Symbolic Data Generation for Math Reasoning

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

Augmenting Math Word Problems via Iterative Question Composing

Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

Adversarial Math Word Problem Generation