Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Thilo Hagendorff,Sarah Fabi,Michal Kosinski
DOI: https://doi.org/10.1038/s43588-023-00527-x
2023-10-06
Nature Computational Science
Abstract:We design a battery of semantic illusions and cognitive reflection tests, aimed to elicit intuitive yet erroneous responses. We administer these tasks, traditionally used to study reasoning and decision-making in humans, to OpenAI's generative pre-trained transformer model family. The results show that as the models expand in size and linguistic proficiency they increasingly display human-like intuitive system 1 thinking and associated cognitive errors. This pattern shifts notably with the introduction of ChatGPT models, which tend to respond correctly, avoiding the traps embedded in the tasks. Both ChatGPT-3.5 and 4 utilize the input–output context window to engage in chain-of-thought reasoning, reminiscent of how people use notepads to support their system 2 thinking. Yet, they remain accurate even when prevented from engaging in chain-of-thought reasoning, indicating that their system-1-like next-word generation processes are more accurate than those of older models. Our findings highlight the value of applying psychological methodologies to study large language models, as this can uncover previously undetected emergent characteristics.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the behavior patterns of large - language models (LLMs) in the reasoning and decision - making processes, especially how they exhibit cognitive characteristics similar to human intuitive thinking and System 1 (fast, automatic, and intuition - based processing) and System 2 (slow, deliberate, and logic - based processing). Specifically, the researchers designed a series of semantic illusions and cognitive reflection test tasks, which are usually used to study human reasoning and decision - making abilities, in order to explore the performance of different - sized and - complexity OpenAI generative pre - trained Transformer model families (such as GPT - series models) and ChatGPT models on these tasks. The main objectives of the study include: 1. **Explore the reasoning abilities of LLMs**: By analyzing the performance of these models when performing cognitive reflection test (CRT) tasks and semantic illusion tasks, understand whether they can make intuitive judgments and deliberate thinking like humans. 2. **Compare the performance of different models**: Compare the different performances of early models (such as GPT - 1 to GPT - 3 - davinci - 003) and the latest models (such as ChatGPT - 3.5 and ChatGPT - 4) when handling these tasks, and explore the impact of the increase in model size and complexity on reasoning abilities. 3. **Evaluate the uniqueness of the ChatGPT model**: Pay special attention to the significant improvement of the ChatGPT model in avoiding intuitive errors and correctly solving problems, and explore the reasons behind it, such as whether it is due to the introduction of a reinforcement - learning mechanism or more training data. Through these studies, the author hopes to reveal the cognitive mechanisms of LLMs when handling complex tasks and provide a theoretical basis for further optimizing these models.