Abstract:Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of large language models (LLMs) in mathematical reasoning. Specifically, we construct a new dataset \textsc{MathTrap} by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8K. Since problems with logical flaws are quite rare in the real world, these represent "unseen" cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not \textbf{spontaneously} combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. Additionally, we test the recently released OpenAI o1 model and find that human-like `slow thinking' helps improve the compositionality of LLMs. Overall, systematic compositionality remains an open challenge for large language models.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to explore the combinatorial generalization ability of large language models (LLMs) in mathematical reasoning. Specifically, the authors construct a new dataset **MATHTRAP** by introducing carefully designed logical traps into existing math problems to test whether these models can systematically combine existing mathematical knowledge with newly introduced trap knowledge to solve these problems. ### Background and Motivation A key feature of human cognition is systematic compositionality, the ability to generate infinite new combinations from a finite set of learned components. This ability is crucial for understanding and handling complex logic. However, while current large language models have achieved significant success in tasks requiring complex reasoning, whether they possess systematic compositionality remains an open question. ### Methods 1. **Constructing the MATHTRAP Dataset**: - The authors extract original problems from the MATH and GSM8K datasets and introduce logical traps into the problem descriptions. - For example, modifying the original problem "Solve the equation \(x^2 + x = 3\)" to "Solve the equation \(x^2 + x = 3\) for integer solutions," which requires the model not only to solve a quadratic equation but also to understand the concept of integers. 2. **Experimental Setup**: - The authors conducted comprehensive tests on multiple leading large language models and recruited 43 undergraduates from top universities as a human control group. - Evaluation metrics include accuracy and performance changes after introducing trap problems. 3. **Intervention Methods**: - To mitigate the shortcomings of LLMs in handling trap problems, the authors explored several external intervention methods, such as natural language prompts, few-shot demonstrations, and fine-tuning. ### Main Findings 1. **Knowledge Possession of LLMs**: - Experimental results show that although LLMs possess the individual knowledge components needed to solve trap problems, they cannot spontaneously combine this knowledge to handle new situations. 2. **Behavioral Differences Between Humans and LLMs**: - Humans show a clear advantage in handling trap problems, flexibly applying existing knowledge. In contrast, LLMs' performance significantly declines when faced with trap problems. 3. **Effectiveness of Intervention Methods**: - Natural language prompts and few-shot demonstrations improve LLMs' performance to some extent, especially in handling trap problems. - Fine-tuning can significantly enhance the model's performance on trap problems but may reduce its accuracy on original problems. ### Conclusion This study reveals that large language models still face significant challenges in combinatorial generalization in mathematical reasoning. Although external interventions can alleviate this issue to some extent, LLMs still lag behind humans in systematic compositionality. Future research can further explore how to automatically generate high-quality trap problems to better evaluate and improve the combinatorial generalization ability of LLMs.

Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning

Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning Through Trap Problems

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

Exploring the Limitations of Large Language Models in Compositional Relation Reasoning

Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology

Faith and Fate: Limits of Transformers on Compositionality

Concise and Organized Perception Facilitates Reasoning in Large Language Models

From Words to Worlds: Compositionality for Cognitive Architectures

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Large Language Models for Mathematical Reasoning: Progresses and Challenges

On Memorization of Large Language Models in Logical Reasoning

Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions

Large Language Models Are Unconscious of Unreasonability in Math Problems

Reasoning in Large Language Models Through Symbolic Math Word Problems

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study

Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge