Abstract:We design a battery of semantic illusions and cognitive reflection tests, aimed to elicit intuitive yet erroneous responses. We administer these tasks, traditionally used to study reasoning and decision-making in humans, to OpenAI's generative pre-trained transformer model family. The results show that as the models expand in size and linguistic proficiency they increasingly display human-like intuitive system 1 thinking and associated cognitive errors. This pattern shifts notably with the introduction of ChatGPT models, which tend to respond correctly, avoiding the traps embedded in the tasks. Both ChatGPT-3.5 and 4 utilize the input–output context window to engage in chain-of-thought reasoning, reminiscent of how people use notepads to support their system 2 thinking. Yet, they remain accurate even when prevented from engaging in chain-of-thought reasoning, indicating that their system-1-like next-word generation processes are more accurate than those of older models. Our findings highlight the value of applying psychological methodologies to study large language models, as this can uncover previously undetected emergent characteristics.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand the behavior patterns of large - language models (LLMs) in the reasoning and decision - making processes, especially how they exhibit cognitive characteristics similar to human intuitive thinking and System 1 (fast, automatic, and intuition - based processing) and System 2 (slow, deliberate, and logic - based processing). Specifically, the researchers designed a series of semantic illusions and cognitive reflection test tasks, which are usually used to study human reasoning and decision - making abilities, in order to explore the performance of different - sized and - complexity OpenAI generative pre - trained Transformer model families (such as GPT - series models) and ChatGPT models on these tasks. The main objectives of the study include: 1. **Explore the reasoning abilities of LLMs**: By analyzing the performance of these models when performing cognitive reflection test (CRT) tasks and semantic illusion tasks, understand whether they can make intuitive judgments and deliberate thinking like humans. 2. **Compare the performance of different models**: Compare the different performances of early models (such as GPT - 1 to GPT - 3 - davinci - 003) and the latest models (such as ChatGPT - 3.5 and ChatGPT - 4) when handling these tasks, and explore the impact of the increase in model size and complexity on reasoning abilities. 3. **Evaluate the uniqueness of the ChatGPT model**: Pay special attention to the significant improvement of the ChatGPT model in avoiding intuitive errors and correctly solving problems, and explore the reasons behind it, such as whether it is due to the introduction of a reinforcement - learning mechanism or more training data. Through these studies, the author hopes to reveal the cognitive mechanisms of LLMs when handling complex tasks and provide a theoretical basis for further optimizing these models.

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

Thinking Fast and Slow in Large Language Models

Herpes simplex virus type 1 infection in mice with severe combined immunodeficiency (SCID).

Do large language models show decision heuristics similar to humans? A case study using GPT-3.5.

Using cognitive psychology to understand GPT-3

[A case of advanced gastric cancer with perianal skin metastasis].

Assessing the nature of large language models: A caution against anthropocentrism

Evaluating Large Language Models in Theory of Mind Tasks

Can generative AI infer thinking style from language? Evaluating the utility of AI as a psychological text analysis tool

ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text

The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks

Deception Abilities Emerged in Large Language Models

Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs

Theory of Mind May Have Spontaneously Emerged in Large Language Models

Assessing Large Language Models' ability to predict how humans balance self-interest and the interest of others

Cognitive Effects in Large Language Models

Testing theory of mind in large language models and humans

Human heuristics for AI-generated language are flawed

ChatGPT broke the Turing test — the race is on for new ways to assess AI

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?