Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs

Ziyan Cui,Ning Li,Huaikang Zhou
2024-09-04
Abstract:Artificial Intelligence (AI) is increasingly being integrated into scientific research, particularly in the social sciences, where understanding human behavior is critical. Large Language Models (LLMs) like GPT-4 have shown promise in replicating human-like responses in various psychological experiments. However, the extent to which LLMs can effectively replace human subjects across diverse experimental contexts remains unclear. Here, we conduct a large-scale study replicating 154 psychological experiments from top social science journals with 618 main effects and 138 interaction effects using GPT-4 as a simulated participant. We find that GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies, closely mirroring human responses in both direction and significance. However, only 19.44 percent of GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Additionally, there is a 71.6 percent rate of unexpected significant results where the original studies reported null findings, suggesting potential overestimation or false positives. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings. While LLMs can complement human studies, they cannot yet fully replace the nuanced insights provided by human subjects.
Computation and Language,Artificial Intelligence,General Economics
What problem does this paper attempt to address?
The paper attempts to address the issue of exploring the application of large language models (LLMs) such as GPT-4 in psychological experiments and their ability to replace human subjects. Specifically, the researchers evaluated the performance of GPT-4 as a simulated participant by replicating 154 psychological experiments from top social science journals on a large scale. The main objectives of the study include: 1. **Evaluating the replication ability of LLMs**: Determining whether LLMs can reliably simulate human behavior in various psychological experiments and analyzing their performance in different experimental contexts. 2. **Comparing LLMs with human data**: Using statistical analysis to compare the data generated by LLMs with human data from the original studies, assessing the similarity of LLM responses in terms of direction, significance, and effect size. 3. **Exploring the limitations of LLMs**: Identifying potential issues with LLMs in specific contexts, particularly when dealing with sensitive topics (e.g., race, moral judgments). The study results show that GPT-4 performed well in replicating main effects, with a success rate of 76%, but had a lower success rate of 47% in replicating interaction effects. Additionally, while LLMs were highly consistent with human responses in terms of effect direction, their effect sizes were generally larger, and there was a higher rate of unexpected significant results. This suggests that researchers should exercise caution when interpreting AI-driven research findings. Overall, LLMs can serve as a research tool to supplement human studies, but they cannot yet fully replace the complex insights provided by human subjects.