On Memorization of Large Language Models in Logical Reasoning

Chulin Xie,Yangsibo Huang,Chiyuan Zhang,Da Yu,Xinyun Chen,Bill Yuchen Lin,Bo Li,Badih Ghazi,Ravi Kumar
2024-10-30
Abstract:Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We found that LLMs could interpolate the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization. This phenomenon indicates that LLMs exhibit a complex interplay between memorization and genuine reasoning abilities. Finally, our analysis with per-sample memorization score sheds light on how LLMs switch between reasoning and memorization in solving logical puzzles. Our code and data are available at <a class="link-external link-https" href="https://memkklogic.github.io" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: whether large language models (LLMs) rely on memory ability when solving logical reasoning tasks, and whether this memory ability is harmful to learning reasoning. Specifically, the paper aims to explore the following two issues through quantitative analysis: 1. **Do LLMs rely on memory to solve reasoning tasks?** The paper proposes a local - inconsistency - based memory score (LiMem) to measure the degree of memory of the model when dealing with logical reasoning tasks. This score combines the accuracy rate (Acc) of the model on the original problem and the consistency ratio (CR) on the locally perturbed problem, that is, \( \text{LiMem}(f;D)=\text{Acc}(f;D)\cdot(1 - \text{CR}(f;D)) \). 2. **Is memory only harmful to learning reasoning?** The paper explores the complex relationship between memory and reasoning ability, especially whether the improvement of the memory level is also accompanied by the improvement of reasoning performance after the model is fine - tuned. In order to systematically study these problems, the paper uses a dynamically generated logical reasoning benchmark test based on "Knights and Knaves" (K&K) puzzles. Through this method, the author can generate new puzzles and perform local perturbations on existing puzzles, thereby evaluating the memory and reasoning abilities of the model. The research results show that although the models show a high memory level on the training set, they do develop real reasoning abilities during the fine - tuning process, and the reasoning performance improves as the memory level increases.