Exploring the Compositional Deficiency of Large Language Models in Mathematical Reasoning

Jun Zhao,Jingqi Tong,Yurong Mou,Ming Zhang,Qi Zhang,Xuanjing Huang
2024-10-10
Abstract:Human cognition exhibits systematic compositionality, the algebraic ability to generate infinite novel combinations from finite learned components, which is the key to understanding and reasoning about complex logic. In this work, we investigate the compositionality of large language models (LLMs) in mathematical reasoning. Specifically, we construct a new dataset \textsc{MathTrap} by introducing carefully designed logical traps into the problem descriptions of MATH and GSM8K. Since problems with logical flaws are quite rare in the real world, these represent "unseen" cases to LLMs. Solving these requires the models to systematically compose (1) the mathematical knowledge involved in the original problems with (2) knowledge related to the introduced traps. Our experiments show that while LLMs possess both components of requisite knowledge, they do not \textbf{spontaneously} combine them to handle these novel cases. We explore several methods to mitigate this deficiency, such as natural language prompts, few-shot demonstrations, and fine-tuning. Additionally, we test the recently released OpenAI o1 model and find that human-like `slow thinking' helps improve the compositionality of LLMs. Overall, systematic compositionality remains an open challenge for large language models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore the combinatorial generalization ability of large language models (LLMs) in mathematical reasoning. Specifically, the authors construct a new dataset **MATHTRAP** by introducing carefully designed logical traps into existing math problems to test whether these models can systematically combine existing mathematical knowledge with newly introduced trap knowledge to solve these problems. ### Background and Motivation A key feature of human cognition is systematic compositionality, the ability to generate infinite new combinations from a finite set of learned components. This ability is crucial for understanding and handling complex logic. However, while current large language models have achieved significant success in tasks requiring complex reasoning, whether they possess systematic compositionality remains an open question. ### Methods 1. **Constructing the MATHTRAP Dataset**: - The authors extract original problems from the MATH and GSM8K datasets and introduce logical traps into the problem descriptions. - For example, modifying the original problem "Solve the equation \(x^2 + x = 3\)" to "Solve the equation \(x^2 + x = 3\) for integer solutions," which requires the model not only to solve a quadratic equation but also to understand the concept of integers. 2. **Experimental Setup**: - The authors conducted comprehensive tests on multiple leading large language models and recruited 43 undergraduates from top universities as a human control group. - Evaluation metrics include accuracy and performance changes after introducing trap problems. 3. **Intervention Methods**: - To mitigate the shortcomings of LLMs in handling trap problems, the authors explored several external intervention methods, such as natural language prompts, few-shot demonstrations, and fine-tuning. ### Main Findings 1. **Knowledge Possession of LLMs**: - Experimental results show that although LLMs possess the individual knowledge components needed to solve trap problems, they cannot spontaneously combine this knowledge to handle new situations. 2. **Behavioral Differences Between Humans and LLMs**: - Humans show a clear advantage in handling trap problems, flexibly applying existing knowledge. In contrast, LLMs' performance significantly declines when faced with trap problems. 3. **Effectiveness of Intervention Methods**: - Natural language prompts and few-shot demonstrations improve LLMs' performance to some extent, especially in handling trap problems. - Fine-tuning can significantly enhance the model's performance on trap problems but may reduce its accuracy on original problems. ### Conclusion This study reveals that large language models still face significant challenges in combinatorial generalization in mathematical reasoning. Although external interventions can alleviate this issue to some extent, LLMs still lag behind humans in systematic compositionality. Future research can further explore how to automatically generate high-quality trap problems to better evaluate and improve the combinatorial generalization ability of LLMs.