Abstract:Reasoning is a key ability for an intelligent system. Large language models (LMs) achieve above-chance performance on abstract reasoning tasks, but exhibit many imperfections. However, human abstract reasoning is also imperfect. For example, human reasoning is affected by our real-world knowledge and beliefs, and shows notable "content effects"; humans reason more reliably when the semantic content of a problem supports the correct logical inferences. These content-entangled reasoning patterns play a central role in debates about the fundamental nature of human intelligence. Here, we investigate whether language models $\unicode{x2014}$ whose prior expectations capture some aspects of human knowledge $\unicode{x2014}$ similarly mix content into their answers to logical problems. We explored this question across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task. We evaluate state of the art large language models, as well as humans, and find that the language models reflect many of the same patterns observed in humans across these tasks $\unicode{x2014}$ like humans, models answer more accurately when the semantic content of a task supports the logical inferences. These parallels are reflected both in answer patterns, and in lower-level features like the relationship between model answer distributions and human response times. Our findings have implications for understanding both these cognitive effects in humans, and the factors that contribute to language model performance.

What problem does this paper attempt to address?

This paper attempts to explore whether large - language models exhibit "content effects" similar to those of humans in logical reasoning tasks. Specifically, researchers are concerned with whether the performance of these models is affected when the content (i.e., semantic content) of logical reasoning problems supports or violates correct logical inferences. This "content effect" means that when humans perform logical reasoning, their accuracy is influenced by background knowledge and beliefs about the problem. For example, when dealing with problems in familiar, believable, or realistic situations, humans can usually reason more accurately; when dealing with unfamiliar, unbelievable, or abstract problems, they perform worse. To verify this hypothesis, researchers designed three logical reasoning tasks to evaluate the reasoning abilities of language models and humans: 1. **Natural Language Inference (NLI)**: Participants need to determine whether a hypothesis logically follows from a premise. 2. **Syllogisms**: Participants need to determine whether a given syllogism is logically valid. 3. **Wason Selection Task**: Participants need to select the cards to be flipped according to a given rule to verify the correctness of the rule. Through these tasks, researchers have found that both humans and language models generally perform better when dealing with logical problems consistent with real - world knowledge, and perform worse when dealing with problems that conflict with real - world knowledge or are completely abstract. This indicates that language models are also affected by the "content effect" to a certain extent, similar to the human reasoning pattern. This finding not only helps to understand the reasoning mechanism of language models, but also provides a new perspective for further exploring the balance between abstract and concrete abilities in human cognition.

Language models show human-like content effects on reasoning tasks

Language models, like humans, show content effects on reasoning tasks

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Language models and psychological sciences

A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Large Language Models Are Not Strong Abstract Reasoners

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

Studying and improving reasoning in humans and machines

Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks

Evaluating the Deductive Competence of Large Language Models

Case Study: Testing Model Capabilities in Some Reasoning Tasks

(Ir)rationality and Cognitive Biases in Large Language Models

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Towards Reasoning in Large Language Models: A Survey

Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

Can Large Language Models Reason? A Characterization via 3-SAT