Abstract:Large language models (LLMs) have seen considerable advancements in natural language understanding tasks, yet there remains a gap to bridge before attaining true artificial general intelligence, especially concerning shortcomings in mathematical reasoning capabilities. We postulate that the inherent nature of LLM training, which focuses on predicting probabilities of next token, presents challenges in effectively modeling mathematical reasoning that demands exact calculations, both from data-driven and theoretical standpoints. In this paper, we address this challenge by enriching the data landscape and introducing a novel math dataset, enhanced with a capability to utilize a Python code interpreter. This dataset is derived from GSM8K and MATH and has been further refined through a combination of GPT-4 annotations, human review, and self-training processes, where the errors in the original GSM8K training set have been fixed. Additionally, we propose a tentative, easily replicable protocol for the fine-tuning of math-specific LLMs, which has led to a significant improvement in the performance of a 7B-parameter LLM on the GSM8K and MATH datasets. We are committed to advancing the field of mathematical reasoning in LLMs and, to that end, we have made source code for data generation / training / inference, and the model checkpoints publicly available at \url{

What problem does this paper attempt to address?

This paper attempts to address the deficiencies of large - language models (LLMs) in mathematical reasoning ability, especially when these tasks require precise calculations. Although large - language models have made significant progress in natural - language - understanding tasks, there is still a large gap in mathematical reasoning. The paper points out that this gap partly stems from the nature of LLMs training, which mainly focuses on the probability prediction of the next word, while mathematical reasoning requires precise calculations, which poses challenges from both data - driven and theoretical perspectives. To meet this challenge, the authors propose a series of improvement measures: 1. **Dataset Augmentation**: A new math dataset is introduced, which not only contains math problems but also combines the capabilities of a Python code interpreter to perform precise calculations. This dataset is based on the existing GSM8K and MATH datasets and has been further optimized through GPT - 4 annotation, manual review, and a self - training process, correcting the errors in the original GSM8K training set. 2. **Fine - Tuning Protocol**: A reproducible fine - tuning protocol is proposed, specifically for training math - specific large - language models. Through this method, the authors have successfully improved the performance of the 7B - parameter model on the GSM8K and MATH datasets. 3. **Multi - task Fine - Tuning**: In addition to traditional supervised fine - tuning, the authors also introduce multi - task fine - tuning, enabling the model to generate solutions and evaluate their quality simultaneously. This is achieved by adding a lightweight binary classifier that can predict the accuracy of the final answer. Overall, this paper aims to improve the performance of large - language models in mathematical - reasoning tasks, especially in scenarios requiring precise calculations, through dataset augmentation and innovation in fine - tuning methods.

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

INC-Math: Integrating Natural Language and Code for Enhanced Mathematical Reasoning in Large Language Models

Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

CoinMath: Harnessing the Power of Coding Instruction for Math LLMs