LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Arash Gholami Davoodi,Seyed Pouyan Mousavi Davoudi,Pouya Pezeshkpour

2024-06-08

Abstract:Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are genuinely engaging in reasoning. To address these gaps, we present the Mathematical Topics Tree (MaTT) benchmark, a challenging and structured benchmark that offers 1,958 questions across a wide array of mathematical subjects, each paired with a detailed hierarchical chain of topics. Upon assessing different LLMs using the MaTT benchmark, we find that the most advanced model, GPT-4, achieved a mere 54\% accuracy in a multiple-choice scenario. Interestingly, even when employing Chain-of-Thought prompting, we observe mostly no notable improvement. Moreover, LLMs accuracy dramatically reduced by up to 24.2 percentage point when the questions were presented without providing choices. Further detailed analysis of the LLMs' performance across a range of topics showed significant discrepancy even for closely related subtopics within the same general mathematical area. In an effort to pinpoint the reasons behind LLMs performances, we conducted a manual evaluation of the completeness and correctness of the explanations generated by GPT-4 when choices were available. Surprisingly, we find that in only 53.3\% of the instances where the model provided a correct answer, the accompanying explanations were deemed complete and accurate, i.e., the model engaged in genuine reasoning.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper aims to address the limitations of large-scale language models (LLMs) in evaluating mathematical reasoning abilities. Although LLMs perform well in solving mathematical problems, the current evaluation focuses mainly on specific mathematical domains and it is difficult to determine whether the models are truly engaging in reasoning. To fill this gap, the paper proposes a challenging structured benchmark test called "Mathematical Topic Tree" (MaTT), which consists of 1958 questions covering a wide range of mathematical topics, each accompanied by a detailed hierarchy of topics. In the paper, the researchers evaluate the performance of different LLMs on the MaTT benchmark and find that the state-of-the-art model GPT-4 only achieves an accuracy rate of 54% in multiple-choice question contexts. Even with the use of Chain-of-Thought prompts, no significant improvement is observed. When options are not provided, the accuracy of LLMs drops by 24.2 percentage points. The analysis shows significant differences in performance of LLMs between different mathematical subtopics, even within closely related subtopics of the same mathematical domain. Further human evaluations indicate that only 53.3% of explanations for correct answers by GPT-4 are deemed complete and accurate, indicating genuine reasoning by the model. The research also reveals that LLMs may rely on strategies such as choice engineering, unverified theorem usage, circular reasoning, or blind memorization rather than genuine mathematical reasoning when answering complex or innovative-thinking questions. The paper creates a comprehensive mathematical evaluation framework through the MaTT benchmark to facilitate a deeper understanding of LLMs' reasoning abilities and uncover subtle differences in their strengths, weaknesses, and problem-solving strategies.

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems

Not All LLM Reasoners Are Created Equal

Benchmarking Large Language Models for Math Reasoning Tasks

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

INC-Math: Integrating Natural Language and Code for Enhanced Mathematical Reasoning in Large Language Models

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents