MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Boning Zhang,Chengxi Li,Kai Fan

2024-04-22

Abstract:Large language models (LLMs) have been explored in a variety of reasoning tasks including solving of mathematical problems. Each math dataset typically includes its own specially designed evaluation script, which, while suitable for its intended use, lacks generalizability across different datasets. Consequently, updates and adaptations to these evaluation tools tend to occur without being systematically reported, leading to inconsistencies and obstacles to fair comparison across studies. To bridge this gap, we introduce a comprehensive mathematical evaluation toolkit that not only utilizes a python computer algebra system (CAS) for its numerical accuracy, but also integrates an optional LLM, known for its considerable natural language processing capabilities. To validate the effectiveness of our toolkit, we manually annotated two distinct datasets. Our experiments demonstrate that the toolkit yields more robust evaluation results compared to prior works, even without an LLM. Furthermore, when an LLM is incorporated, there is a notable enhancement. The code for our method will be made available at \url{

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the critical issue of the lack of standardization and consistency in automatic evaluation tools for mathematical reasoning tasks. Specifically: 1. **Problems with existing evaluation tools**: Most current mathematical datasets come with specially designed evaluation scripts that, while suitable for specific datasets, lack generality. As a result, it is difficult to fairly compare evaluation results across different studies. 2. **Proposed solution**: To fill this gap, the authors developed a comprehensive mathematical evaluation toolkit (MARIO Eval), which not only utilizes the Python Computer Algebra System (CAS) to ensure numerical accuracy but also optionally integrates a Large Language Model (LLM) to enhance natural language processing capabilities. This toolkit can recognize various types of mathematical answers and evaluate the equivalence between expected and predicted answers through type-specific functions. 3. **Experimental validation**: By manually annotating 2 different datasets, the authors demonstrated that even without using the LLM, the toolkit provides more robust evaluation results than previous methods; and when the LLM is included, the evaluation accuracy is further improved. In summary, the goal of this paper is to establish a convenient and standardized evaluation framework to support future research in the field of mathematical reasoning.

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

Evaluating Mathematical Reasoning Beyond Accuracy

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching

Evaluating Language Models for Mathematics through Interactions

Large Language Models for Mathematical Reasoning: Progresses and Challenges

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs