MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Boning Zhang,Chengxi Li,Kai Fan
2024-04-22
Abstract:Large language models (LLMs) have been explored in a variety of reasoning tasks including solving of mathematical problems. Each math dataset typically includes its own specially designed evaluation script, which, while suitable for its intended use, lacks generalizability across different datasets. Consequently, updates and adaptations to these evaluation tools tend to occur without being systematically reported, leading to inconsistencies and obstacles to fair comparison across studies. To bridge this gap, we introduce a comprehensive mathematical evaluation toolkit that not only utilizes a python computer algebra system (CAS) for its numerical accuracy, but also integrates an optional LLM, known for its considerable natural language processing capabilities. To validate the effectiveness of our toolkit, we manually annotated two distinct datasets. Our experiments demonstrate that the toolkit yields more robust evaluation results compared to prior works, even without an LLM. Furthermore, when an LLM is incorporated, there is a notable enhancement. The code for our method will be made available at \url{
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the critical issue of the lack of standardization and consistency in automatic evaluation tools for mathematical reasoning tasks. Specifically: 1. **Problems with existing evaluation tools**: Most current mathematical datasets come with specially designed evaluation scripts that, while suitable for specific datasets, lack generality. As a result, it is difficult to fairly compare evaluation results across different studies. 2. **Proposed solution**: To fill this gap, the authors developed a comprehensive mathematical evaluation toolkit (MARIO Eval), which not only utilizes the Python Computer Algebra System (CAS) to ensure numerical accuracy but also optionally integrates a Large Language Model (LLM) to enhance natural language processing capabilities. This toolkit can recognize various types of mathematical answers and evaluate the equivalence between expected and predicted answers through type-specific functions. 3. **Experimental validation**: By manually annotating 2 different datasets, the authors demonstrated that even without using the LLM, the toolkit provides more robust evaluation results than previous methods; and when the LLM is included, the evaluation accuracy is further improved. In summary, the goal of this paper is to establish a convenient and standardized evaluation framework to support future research in the field of mathematical reasoning.