MIND: Math Informed syNthetic Dialogues for Pretraining LLMs

Syeda Nahida Akter,Shrimai Prabhumoye,John Kamalu,Sanjeev Satheesh,Eric Nyberg,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
2024-10-16
Abstract:The utility of synthetic data to enhance pretraining data quality and hence to improve downstream task accuracy has been widely explored in recent large language models (LLMs). Yet, these approaches fall inadequate in complex, multi-hop and mathematical reasoning tasks as the synthetic data typically fails to add complementary knowledge to the existing raw corpus. In this work, we propose a novel large-scale and diverse Math Informed syNthetic Dialogue (MIND) generation method that improves the mathematical reasoning ability of LLMs. Specifically, using MIND, we generate synthetic conversations based on OpenWebMath (OWM), resulting in a new math corpus, MIND-OWM. Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data. We further identify an effective way to format and integrate synthetic and raw data during pretraining to maximize the gain in mathematical reasoning, emphasizing the need to restructure raw data rather than use it as-is. Compared to pretraining just on raw data, a model pretrained on MIND-OWM shows significant boost in mathematical reasoning (GSM8K: +13.42%, MATH: +2.30%), including superior performance in specialized knowledge (MMLU: +4.55%, MMLU-STEM: +4.28%) and general purpose reasoning tasks (GENERAL REASONING: +2.51%).
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enhance the performance of large - language models (LLMs) in complex mathematical reasoning tasks by generating high - quality synthetic dialogue data. Specifically, the existing pre - training methods are not effective in handling multi - step reasoning and mathematical reasoning tasks, because synthetic data usually cannot add complementary knowledge to the existing original corpora. For this reason, the author proposes a new method named MIND (Math - Informed Synthetic Dialogue), aiming to generate structured and information - rich synthetic dialogue data to improve the mathematical reasoning ability of LLMs. ### Main problems 1. **Limitations of existing methods**: - The existing pre - training methods perform poorly in handling complex, multi - step reasoning and mathematical reasoning tasks. - Synthetic data usually cannot add complementary knowledge to the existing original corpora, especially in terms of mathematical reasoning. 2. **Objectives**: - Propose a new large - scale, diverse method for generating math - informed synthetic dialogues (MIND) to improve the mathematical reasoning ability of LLMs. - Generate synthetic dialogues based on OpenWebMath (OWM) to form a new mathematical corpus (MIND - OWM). - Explore how to effectively integrate synthetic data and original data during the pre - training process to maximize the improvement of mathematical reasoning ability. ### Solutions 1. **MIND method**: - Use a pre - trained LLM to generate synthetic dialogues based on OWM, which not only decompose the original context but also explore each step in depth. - Optimize the generated dialogues through a heuristic filter to ensure high - quality data for pre - training. - Emphasize the importance of the knowledge gap between participants in generating high - quality mathematical data. 2. **Experimental verification**: - Through experiments with different dialogue styles, evaluate the performance of each style in mathematical reasoning tasks. - Compare the performance of models pre - trained with original data and synthetic data, and the results show that synthetic data significantly improves mathematical reasoning ability. 3. **Extension and application**: - Apply the MIND method to larger - scale datasets (such as OWM - 14B) to verify its effectiveness on large - scale data. - Explore how to reorganize the original data during the pre - training process to optimize the reasoning process. ### Key contributions - Propose the MIND method, generate 64 billion tokens of synthetic data, and significantly improve mathematical reasoning ability. - Verify through experiments the influence of different dialogue styles on reasoning tasks, and emphasize the importance of the knowledge gap in generating high - quality mathematical data. - Demonstrate how to effectively integrate synthetic data and original data during the pre - training process to maximize the improvement of reasoning ability. In general, this paper solves the deficiencies of existing pre - training methods in mathematical reasoning tasks by proposing the MIND method, provides a new way to generate high - quality synthetic dialogue data, and thus significantly improves the mathematical reasoning ability of LLMs.