Social Science Meets LLMs: How Reliable Are Large Language Models in Social Simulations?

Yue Huang,Zhengqing Yuan,Yujun Zhou,Kehan Guo,Xiangqi Wang,Haomin Zhuang,Weixiang Sun,Lichao Sun,Jindong Wang,Yanfang Ye,Xiangliang Zhang
2024-10-31
Abstract:Large Language Models (LLMs) are increasingly employed for simulations, enabling applications in role-playing agents and Computational Social Science (CSS). However, the reliability of these simulations is under-explored, which raises concerns about the trustworthiness of LLMs in these applications. In this paper, we aim to answer ``How reliable is LLM-based simulation?'' To address this, we introduce TrustSim, an evaluation dataset covering 10 CSS-related topics, to systematically investigate the reliability of the LLM simulation. We conducted experiments on 14 LLMs and found that inconsistencies persist in the LLM-based simulated roles. In addition, the consistency level of LLMs does not strongly correlate with their general performance. To enhance the reliability of LLMs in simulation, we proposed Adaptive Learning Rate Based ORPO (AdaORPO), a reinforcement learning-based algorithm to improve the reliability in simulation across 7 LLMs. Our research provides a foundation for future studies to explore more robust and trustworthy LLM-based simulations.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the reliability of large language models (LLMs) in social simulation. Specifically, the authors seek to answer the question: "How reliable are LLM-based simulations?" The core of this question lies in whether the responses generated by LLMs can remain consistent with predefined character traits, cognitive abilities, behaviors, and other attributes, thereby providing credible simulation results in various application scenarios such as role-playing agents and computational social science. ### Background and Motivation In recent years, large language models (LLMs) have garnered widespread attention for their outstanding performance in the field of natural language processing (NLP). These models have demonstrated significant capabilities not only in healthcare, data generation, agents, and scientific research but also in social simulation, where users can leverage these models' human simulation abilities by providing predefined character profiles. However, despite the excellent performance of LLMs in these applications, the reliability of their simulation results has not been fully explored, raising concerns about the credibility of LLMs in these applications. ### Research Objectives To assess the reliability of LLMs in social simulation, the authors introduced the TRUST SIM dataset, covering 10 topics related to computational social science. Through this dataset, the authors systematically investigated the performance of 14 popular LLMs in simulations and found that although most LLMs performed well in simulations, there is still room for improvement. Additionally, there is no strong correlation between the simulation capabilities of LLMs and their general performance, with some LLMs providing inconsistent answers to the same question in different formats. ### Main Contributions 1. **Introduction of the TRUST SIM dataset**: Covering 10 topics related to computational social science, used for systematically evaluating the reliability of LLMs in simulations. 2. **Extensive experiments**: Conducted extensive experiments on 14 popular LLMs based on the TRUST SIM dataset, identifying several key insights. 3. **Proposed the AdaORPO algorithm**: A reinforcement learning-based algorithm aimed at improving the reliability of LLMs in simulations, validated on 7 LLMs. ### Conclusion Ensuring the reliability of LLM-based simulations is of great significance for future research. This study lays the foundation for future exploration of more robust and credible LLM simulations.