Abstract:Large Language Models (LLMs) are increasingly employed for simulations, enabling applications in role-playing agents and Computational Social Science (CSS). However, the reliability of these simulations is under-explored, which raises concerns about the trustworthiness of LLMs in these applications. In this paper, we aim to answer ``How reliable is LLM-based simulation?'' To address this, we introduce TrustSim, an evaluation dataset covering 10 CSS-related topics, to systematically investigate the reliability of the LLM simulation. We conducted experiments on 14 LLMs and found that inconsistencies persist in the LLM-based simulated roles. In addition, the consistency level of LLMs does not strongly correlate with their general performance. To enhance the reliability of LLMs in simulation, we proposed Adaptive Learning Rate Based ORPO (AdaORPO), a reinforcement learning-based algorithm to improve the reliability in simulation across 7 LLMs. Our research provides a foundation for future studies to explore more robust and trustworthy LLM-based simulations.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the reliability of large language models (LLMs) in social simulation. Specifically, the authors seek to answer the question: "How reliable are LLM-based simulations?" The core of this question lies in whether the responses generated by LLMs can remain consistent with predefined character traits, cognitive abilities, behaviors, and other attributes, thereby providing credible simulation results in various application scenarios such as role-playing agents and computational social science. ### Background and Motivation In recent years, large language models (LLMs) have garnered widespread attention for their outstanding performance in the field of natural language processing (NLP). These models have demonstrated significant capabilities not only in healthcare, data generation, agents, and scientific research but also in social simulation, where users can leverage these models' human simulation abilities by providing predefined character profiles. However, despite the excellent performance of LLMs in these applications, the reliability of their simulation results has not been fully explored, raising concerns about the credibility of LLMs in these applications. ### Research Objectives To assess the reliability of LLMs in social simulation, the authors introduced the TRUST SIM dataset, covering 10 topics related to computational social science. Through this dataset, the authors systematically investigated the performance of 14 popular LLMs in simulations and found that although most LLMs performed well in simulations, there is still room for improvement. Additionally, there is no strong correlation between the simulation capabilities of LLMs and their general performance, with some LLMs providing inconsistent answers to the same question in different formats. ### Main Contributions 1. **Introduction of the TRUST SIM dataset**: Covering 10 topics related to computational social science, used for systematically evaluating the reliability of LLMs in simulations. 2. **Extensive experiments**: Conducted extensive experiments on 14 popular LLMs based on the TRUST SIM dataset, identifying several key insights. 3. **Proposed the AdaORPO algorithm**: A reinforcement learning-based algorithm aimed at improving the reliability of LLMs in simulations, validated on 7 LLMs. ### Conclusion Ensuring the reliability of LLM-based simulations is of great significance for future research. This study lays the foundation for future exploration of more robust and credible LLM simulations.

Social Science Meets LLMs: How Reliable Are Large Language Models in Social Simulations?

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Can Large Language Model Agents Simulate Human Trust Behavior?

Sense and Sensitivity: Evaluating the simulation of social dynamics via Large Language Models

Position: TrustLLM: Trustworthiness in Large Language Models

Are Large Language Models Chameleons? An Attempt to Simulate Social Surveys

Simulating Field Experiments with Large Language Models

TrustLLM: Trustworthiness in Large Language Models

Shall We Team Up: Exploring Spontaneous Cooperation of Competing LLM Agents

Logic-Enhanced Language Model Agents for Trustworthy Social Simulations

GenSim: A General Social Simulation Platform with Large Language Model based Agents

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

From Individual to Society: A Survey on Social Simulation Driven by Large Language Model-based Agents

XTRUST: On the Multilingual Trustworthiness of Large Language Models

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Can Large Language Models Transform Computational Social Science?

S3: Social-network Simulation System with Large Language Model-Empowered Agents

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

User Behavior Simulation with Large Language Model based Agents

When to Trust LLMs: Aligning Confidence with Response Quality