Abstract:The utility of synthetic data to enhance pretraining data quality and hence to improve downstream task accuracy has been widely explored in recent large language models (LLMs). Yet, these approaches fall inadequate in complex, multi-hop and mathematical reasoning tasks as the synthetic data typically fails to add complementary knowledge to the existing raw corpus. In this work, we propose a novel large-scale and diverse Math Informed syNthetic Dialogue (MIND) generation method that improves the mathematical reasoning ability of LLMs. Specifically, using MIND, we generate synthetic conversations based on OpenWebMath (OWM), resulting in a new math corpus, MIND-OWM. Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data. We further identify an effective way to format and integrate synthetic and raw data during pretraining to maximize the gain in mathematical reasoning, emphasizing the need to restructure raw data rather than use it as-is. Compared to pretraining just on raw data, a model pretrained on MIND-OWM shows significant boost in mathematical reasoning (GSM8K: +13.42%, MATH: +2.30%), including superior performance in specialized knowledge (MMLU: +4.55%, MMLU-STEM: +4.28%) and general purpose reasoning tasks (GENERAL REASONING: +2.51%).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enhance the performance of large - language models (LLMs) in complex mathematical reasoning tasks by generating high - quality synthetic dialogue data. Specifically, the existing pre - training methods are not effective in handling multi - step reasoning and mathematical reasoning tasks, because synthetic data usually cannot add complementary knowledge to the existing original corpora. For this reason, the author proposes a new method named MIND (Math - Informed Synthetic Dialogue), aiming to generate structured and information - rich synthetic dialogue data to improve the mathematical reasoning ability of LLMs. ### Main problems 1. **Limitations of existing methods**: - The existing pre - training methods perform poorly in handling complex, multi - step reasoning and mathematical reasoning tasks. - Synthetic data usually cannot add complementary knowledge to the existing original corpora, especially in terms of mathematical reasoning. 2. **Objectives**: - Propose a new large - scale, diverse method for generating math - informed synthetic dialogues (MIND) to improve the mathematical reasoning ability of LLMs. - Generate synthetic dialogues based on OpenWebMath (OWM) to form a new mathematical corpus (MIND - OWM). - Explore how to effectively integrate synthetic data and original data during the pre - training process to maximize the improvement of mathematical reasoning ability. ### Solutions 1. **MIND method**: - Use a pre - trained LLM to generate synthetic dialogues based on OWM, which not only decompose the original context but also explore each step in depth. - Optimize the generated dialogues through a heuristic filter to ensure high - quality data for pre - training. - Emphasize the importance of the knowledge gap between participants in generating high - quality mathematical data. 2. **Experimental verification**: - Through experiments with different dialogue styles, evaluate the performance of each style in mathematical reasoning tasks. - Compare the performance of models pre - trained with original data and synthetic data, and the results show that synthetic data significantly improves mathematical reasoning ability. 3. **Extension and application**: - Apply the MIND method to larger - scale datasets (such as OWM - 14B) to verify its effectiveness on large - scale data. - Explore how to reorganize the original data during the pre - training process to optimize the reasoning process. ### Key contributions - Propose the MIND method, generate 64 billion tokens of synthetic data, and significantly improve mathematical reasoning ability. - Verify through experiments the influence of different dialogue styles on reasoning tasks, and emphasize the importance of the knowledge gap in generating high - quality mathematical data. - Demonstrate how to effectively integrate synthetic data and original data during the pre - training process to maximize the improvement of reasoning ability. In general, this paper solves the deficiencies of existing pre - training methods in mathematical reasoning tasks by proposing the MIND method, provides a new way to generate high - quality synthetic dialogue data, and thus significantly improves the mathematical reasoning ability of LLMs.

MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Neuro-Symbolic Data Generation for Math Reasoning

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning

Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

AI-Assisted Generation of Difficult Math Questions

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline

Enhancing Mathematical Reasoning in LLMs with Background Operators

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?