Abstract:Do large language models (LLMs) display rational reasoning? LLMs have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. In this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. We find that, like humans, LLMs display irrationality in these tasks. However, the way this irrationality is displayed does not reflect that shown by humans. When incorrect answers are given by LLMs to these tasks, they are often incorrect in ways that differ from human-like biases. On top of this, the LLMs reveal an additional layer of irrationality in the significant inconsistency of the responses. Aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore whether large language models (LLMs) can exhibit rational reasoning. Specifically, the author evaluates seven different large language models by using tasks in cognitive psychology to answer the following questions: 1. **Do LLMs exhibit irrational behavior like humans?** - The paper finds that LLMs do indeed exhibit irrationality in these tasks, but their irrationality is different from that of humans. When LLMs give wrong answers, their wrong ways are often different from human cognitive biases. 2. **Is the irrational performance of LLMs consistent?** - The research finds that there is significant inconsistency in the responses of LLMs on the same task, which indicates that LLMs have a new irrational feature in rational reasoning - the inconsistency of results. 3. **How to evaluate and compare the rational reasoning abilities of different LLMs?** - Besides the experimental results, the paper also makes a methodological contribution, that is, showing how to evaluate and compare the rational reasoning abilities of different types of LLMs. The author uses tasks designed by Kahneman and Tversky et al., which were originally designed to reveal human cognitive biases. ### Specific research content - **Evaluation objects**: Seven LLMs, including GPT - 3.5, GPT - 4, Bard, Claude 2 and three versions of Llama 2 (7B, 13B and 70B parameter models). - **Evaluation tasks**: Based on classic tasks in cognitive psychology literature, such as Wason task, AIDS task, hospital problem, Monty Hall problem, etc., a total of 12 tasks. - **Evaluation dimensions**: The response of each task is classified into two dimensions: correctness (Correct) and human - likeness (Human - like). See Table 2 for specific classification: - Correct and reasoned - Correct but illogical reasoning - Incorrect and human - like - Incorrect and non - human - like ### Main findings - **Performance differences**: GPT - 4 performs the best, giving correct and logical answers in 69.2% of cases; while the Llama 2 7B model performs the worst, giving wrong answers in 77.5% of cases. - **Irrational performance**: Most wrong answers are not due to cognitive biases, but due to illogical reasoning or calculation errors. - **Inconsistency**: There is significant inconsistency in the responses of LLMs on the same task, especially in mathematical calculation tasks. ### Conclusion By systematically evaluating the rational reasoning abilities of LLMs, the paper reveals the unique irrational characteristics of these models when dealing with cognitive tasks and proposes a method for evaluating and comparing the performance of LLMs. This research provides a basis for developing benchmarks for testing model rationality in the future.

(Ir)rationality and Cognitive Biases in Large Language Models

(Ir)rationality and cognitive biases in large language models

Large Language Models Assume People are More Rational than We Really are

Comparing Rationality Between Large Language Models and Humans: Insights and Open Questions

Analysis of hybrid imaging techniques

Studying and improving reasoning in humans and machines

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Rationality Report Cards: Assessing the Economic Rationality of Large Language Models

Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments

Language models show human-like content effects on reasoning tasks

Evaluating Large Language Models with NeuBAROCO: Syllogistic Reasoning Ability and Human-like Biases

CBEval: A framework for evaluating and interpreting cognitive biases in LLMs

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Birth defects among children born to a population occupationally exposed to pesticides in Colombia.

Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions

Cognitive Bias in Decision-Making with LLMs

Exploring Reasoning Biases in Large Language Models Through Syllogism: Insights from the NeuBAROCO Dataset

Can Large Language Models Reason? A Characterization via 3-SAT

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond