Abstract:Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing methods for evaluating the mathematical reasoning ability of large - language models (LLMs) are deficient. Specifically, the existing methods mainly rely on the intermediate reasoning steps in the final answers or static examples, and these methods cannot fully reveal whether the model truly understands the reasoning process of mathematical problems. The paper proposes a new evaluation method - ReasonAgain. By using symbolic programs (i.e., Python programs) to generate multiple input - output pairs, it can more comprehensively evaluate the model's reasoning ability under different inputs. This method can not only detect whether the model relies on shortcuts or incorrect reasoning, but also accommodate different solutions, improving the reliability and accuracy of the evaluation. ### Main Contributions 1. **Propose the ReasonAgain method**: Generate symbolic programs to evaluate the model's reasoning ability under different inputs, ensuring that the model can consistently produce the correct final answers. 2. **Generate diverse test cases**: Use the extracted programs to generate new input - output pairs and test the model's performance under different inputs. 3. **Reveal the vulnerabilities of existing models**: The experimental results show that the existing state - of - the - art LLMs have a significant performance decline when facing the newly generated test cases, indicating that they are deficient in mathematical reasoning. ### Method Overview 1. **Generate symbolic programs**: Use GPT - 4 - o to generate Python programs that can solve the original mathematical problems. 2. **Verify programs**: Verify whether the generated programs are correct through the original input - output pairs. 3. **Generate new test cases**: Generate new input - output pairs based on the extracted programs and test the model's performance on these new cases. 4. **Evaluate the model**: Compare the model's performance on the newly generated test cases with that on the original data set to evaluate the robustness of its reasoning ability. ### Experimental Results - **Performance decline**: On the GSM8K and MATH data sets, all models have a significant performance decline on the newly generated test cases, indicating the vulnerability of existing models in mathematical reasoning. - **Normalized accuracy**: Even on the questions that the model initially answered correctly, the newly generated test cases also expose the model's deficiencies. Only 50% to 80% of the new problems are answered correctly. - **Proportion of true understanding**: The proportion of the model's true understanding of the questions is at most 50%, and sometimes even less than 30%. ### Conclusion ReasonAgain provides a more effective evaluation method, which can more realistically reflect the actual ability of large - language models in mathematical reasoning and reveals the deficiencies of the existing evaluation methods. Future work can further improve the generated programs to increase the coverage and accuracy of the evaluation.

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

Reasoning in Large Language Models Through Symbolic Math Word Problems

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Evaluating Mathematical Reasoning Beyond Accuracy

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline

Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

Benchmarking Large Language Models for Math Reasoning Tasks

Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities

Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

Neuro-Symbolic Data Generation for Math Reasoning