MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline

Minpeng Liao,Wei Luo,Chengxi Li,Jing Wu,Kai Fan
2024-02-22
Abstract:Large language models (LLMs) have seen considerable advancements in natural language understanding tasks, yet there remains a gap to bridge before attaining true artificial general intelligence, especially concerning shortcomings in mathematical reasoning capabilities. We postulate that the inherent nature of LLM training, which focuses on predicting probabilities of next token, presents challenges in effectively modeling mathematical reasoning that demands exact calculations, both from data-driven and theoretical standpoints. In this paper, we address this challenge by enriching the data landscape and introducing a novel math dataset, enhanced with a capability to utilize a Python code interpreter. This dataset is derived from GSM8K and MATH and has been further refined through a combination of GPT-4 annotations, human review, and self-training processes, where the errors in the original GSM8K training set have been fixed. Additionally, we propose a tentative, easily replicable protocol for the fine-tuning of math-specific LLMs, which has led to a significant improvement in the performance of a 7B-parameter LLM on the GSM8K and MATH datasets. We are committed to advancing the field of mathematical reasoning in LLMs and, to that end, we have made source code for data generation / training / inference, and the model checkpoints publicly available at \url{
Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the deficiencies of large - language models (LLMs) in mathematical reasoning ability, especially when these tasks require precise calculations. Although large - language models have made significant progress in natural - language - understanding tasks, there is still a large gap in mathematical reasoning. The paper points out that this gap partly stems from the nature of LLMs training, which mainly focuses on the probability prediction of the next word, while mathematical reasoning requires precise calculations, which poses challenges from both data - driven and theoretical perspectives. To meet this challenge, the authors propose a series of improvement measures: 1. **Dataset Augmentation**: A new math dataset is introduced, which not only contains math problems but also combines the capabilities of a Python code interpreter to perform precise calculations. This dataset is based on the existing GSM8K and MATH datasets and has been further optimized through GPT - 4 annotation, manual review, and a self - training process, correcting the errors in the original GSM8K training set. 2. **Fine - Tuning Protocol**: A reproducible fine - tuning protocol is proposed, specifically for training math - specific large - language models. Through this method, the authors have successfully improved the performance of the 7B - parameter model on the GSM8K and MATH datasets. 3. **Multi - task Fine - Tuning**: In addition to traditional supervised fine - tuning, the authors also introduce multi - task fine - tuning, enabling the model to generate solutions and evaluate their quality simultaneously. This is achieved by adding a lightweight binary classifier that can predict the accuracy of the final answer. Overall, this paper aims to improve the performance of large - language models in mathematical - reasoning tasks, especially in scenarios requiring precise calculations, through dataset augmentation and innovation in fine - tuning methods.