Abstract:Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM's own style.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to evaluate the performance of large language models (LLMs) in simulating human conversations. Specifically, the researchers focus on the following core issues: 1. **Can LLMs accurately simulate human conversations?** The researchers generated a large - scale dataset containing 100,000 pairs of LLM - LLM and human - LLM conversations and quantified the alignment degree between LLM simulations and real - human conversations. They found that there is a relatively low consistency between LLM simulations and human interactions, especially showing systematic differences in multiple text attributes such as style and content. 2. **The influence of different models and prompt instructions** The researchers explored how the selection of different LLM models and prompt instructions affects the ability of LLMs to simulate human behavior. They used 9 different LLM models and 50 different prompt instructions to evaluate the impact of these factors on the simulation effect. 3. **Performance in a multilingual environment** The researchers further analyzed the performance of LLMs in English, Chinese, and Russian conversations and found that the models' performance in these three languages is similar, but still generally low. 4. **Under which circumstances are LLMs more likely to effectively simulate human responses?** The researchers explored, through regression analysis, which factors enable LLMs to better simulate human responses. The results show that when the initial conversation style of humans is closer to the style of the LLM itself, the LLM can better match the subsequent conversation behavior. ### Main contributions 1. **Proposing a general evaluation framework** The researchers introduced a general evaluation framework for meaningfully analyzing human - LLM simulations and provided a new dataset containing more than 1,200 annotator responses for comparison with human - level performance. 2. **Large - scale analysis** A large - scale analysis was carried out on 9 LLM models simulating 2,000 English human - LLM conversations under 50 prompt instructions. Even the optimal combination of models and prompts is relatively weak in simulating human behavior. 3. **Multilingual analysis** A multilingual analysis was carried out on 10,000 Chinese and Russian human - LLM conversations, and it was found that the performance is roughly similar but still low. 4. **Regression analysis** Through regression analysis, the researchers revealed which factors cause the responses of LLMs to be closer to humans. The results show that when humans start a conversation in a writing style similar to that of the LLM, the LLM can better match the subsequent conversation behavior. ### Conclusion Through large - scale experiments and multi - angle analysis, this paper reveals the limitations and potential improvement directions of LLMs in simulating human conversations. The research results have important reference value for the future development of more effective conversation systems.

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Dialogue Learning with Human-in-the-Loop.

Leveraging LLMs for Dialogue Quality Measurement

LLM Roleplay: Simulating Human-Chatbot Interaction

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO

HLB: Benchmarking LLMs' Humanlikeness in Language Use

Simulating User Agents for Embodied Conversational-AI

Are You Human? An Adversarial Benchmark to Expose LLMs

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

Are Large Language Models Chameleons? An Attempt to Simulate Social Surveys

Exploring the Dialogue Comprehension Ability of Large Language Models

Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks

Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

Human Simulacra: Benchmarking the Personification of Large Language Models

PlatoLM: Teaching LLMs in Multi-Round Dialogue via a User Simulator

Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions?

Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre