Abstract:Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9\% significantly.

What problem does this paper attempt to address?

The paper primarily explores the performance of Large Language Models (LLMs) in mathematical reasoning tasks and investigates the impact of pre-training loss, the amount of supervised data, and the volume of augmented data on the model's mathematical reasoning capabilities. ### Research Questions The paper attempts to address the following core questions: 1. **Relationship between Pre-training Loss and Model Performance**: The study examines how pre-training loss affects the performance of models during Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). 2. **Impact of Supervised Data Volume**: It analyzes the effect of varying amounts of supervised data on the model's mathematical reasoning performance. 3. **Role of Augmented Data**: The paper explores how to generate more correct reasoning paths through Rejection Sampling (RFT) and use these paths to further enhance the model's mathematical reasoning ability. ### Main Findings - **Pre-training Loss and Performance**: The research finds that as pre-training loss decreases, indicating an improvement in pre-trained model quality, the model's SFT and ICL performance shows a linear growth trend. However, this growth rate gradually slows down as the model quality improves. - **Supervised Data Volume and Performance**: There is a logarithmic linear relationship between model performance and the volume of supervised data. Increasing the data volume can improve model performance, but the gains diminish gradually for models with better pre-training effects. - **Effectiveness of Augmented Data**: Data obtained through Rejection Sampling (RFT) significantly enhances the model's mathematical reasoning capabilities. The key factor is the number of different reasoning paths generated. Combining augmented samples from multiple models can further improve performance. ### Methodology - **Supervised Fine-Tuning (SFT)**: The model is trained in a supervised manner using a dataset of labeled mathematical problems. - **Rejection Sampling (RFT)**: Correct reasoning paths are generated using existing models as additional training data to enhance the model's reasoning capabilities. - **Pre-training Loss Analysis**: Comparing the loss values of different pre-trained models with their performance on mathematical reasoning tasks. ### Experimental Results - Models with lower pre-training loss perform well on mathematical reasoning tasks, and SFT performance increases linearly with the decrease in pre-training loss, but the rate of increase gradually slows down. - Increasing the volume of supervised data effectively improves model performance, especially for smaller models. For larger models, however, the performance gains are more limited. - The RFT method can significantly improve model performance, particularly for models with poorer performance. Combining augmented samples from multiple models can further enhance performance, achieving better results than a single model. In summary, the paper delves into the factors affecting the mathematical reasoning capabilities of large language models and proposes a simple yet effective Rejection Sampling method to improve model performance.

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

Evaluating Mathematical Reasoning Beyond Accuracy

Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems

Benchmarking Large Language Models for Math Reasoning Tasks

ReFT: Reasoning with Reinforced Fine-Tuning

Targeted training for numerical reasoning with large language models

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning

Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation

Democratizing Reasoning Ability: Tailored Learning from Large Language Model

MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data