ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

Xiaodong Yu,Ben Zhou,Hao Cheng,Dan Roth
2024-10-25
Abstract:Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing methods for evaluating the mathematical reasoning ability of large - language models (LLMs) are deficient. Specifically, the existing methods mainly rely on the intermediate reasoning steps in the final answers or static examples, and these methods cannot fully reveal whether the model truly understands the reasoning process of mathematical problems. The paper proposes a new evaluation method - ReasonAgain. By using symbolic programs (i.e., Python programs) to generate multiple input - output pairs, it can more comprehensively evaluate the model's reasoning ability under different inputs. This method can not only detect whether the model relies on shortcuts or incorrect reasoning, but also accommodate different solutions, improving the reliability and accuracy of the evaluation. ### Main Contributions 1. **Propose the ReasonAgain method**: Generate symbolic programs to evaluate the model's reasoning ability under different inputs, ensuring that the model can consistently produce the correct final answers. 2. **Generate diverse test cases**: Use the extracted programs to generate new input - output pairs and test the model's performance under different inputs. 3. **Reveal the vulnerabilities of existing models**: The experimental results show that the existing state - of - the - art LLMs have a significant performance decline when facing the newly generated test cases, indicating that they are deficient in mathematical reasoning. ### Method Overview 1. **Generate symbolic programs**: Use GPT - 4 - o to generate Python programs that can solve the original mathematical problems. 2. **Verify programs**: Verify whether the generated programs are correct through the original input - output pairs. 3. **Generate new test cases**: Generate new input - output pairs based on the extracted programs and test the model's performance on these new cases. 4. **Evaluate the model**: Compare the model's performance on the newly generated test cases with that on the original data set to evaluate the robustness of its reasoning ability. ### Experimental Results - **Performance decline**: On the GSM8K and MATH data sets, all models have a significant performance decline on the newly generated test cases, indicating the vulnerability of existing models in mathematical reasoning. - **Normalized accuracy**: Even on the questions that the model initially answered correctly, the newly generated test cases also expose the model's deficiencies. Only 50% to 80% of the new problems are answered correctly. - **Proportion of true understanding**: The proportion of the model's true understanding of the questions is at most 50%, and sometimes even less than 30%. ### Conclusion ReasonAgain provides a more effective evaluation method, which can more realistically reflect the actual ability of large - language models in mathematical reasoning and reveals the deficiencies of the existing evaluation methods. Future work can further improve the generated programs to increase the coverage and accuracy of the evaluation.