Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Jonathan Ivey,Shivani Kumar,Jiayu Liu,Hua Shen,Sushrita Rakshit,Rohan Raju,Haotian Zhang,Aparna Ananthasubramaniam,Junghwan Kim,Bowen Yi,Dustin Wright,Abraham Israeli,Anders Giovanni Møller,Lechen Zhang,David Jurgens
2024-09-17
Abstract:Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM's own style.
Computation and Language,Computers and Society,Human-Computer Interaction
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to evaluate the performance of large language models (LLMs) in simulating human conversations. Specifically, the researchers focus on the following core issues: 1. **Can LLMs accurately simulate human conversations?** The researchers generated a large - scale dataset containing 100,000 pairs of LLM - LLM and human - LLM conversations and quantified the alignment degree between LLM simulations and real - human conversations. They found that there is a relatively low consistency between LLM simulations and human interactions, especially showing systematic differences in multiple text attributes such as style and content. 2. **The influence of different models and prompt instructions** The researchers explored how the selection of different LLM models and prompt instructions affects the ability of LLMs to simulate human behavior. They used 9 different LLM models and 50 different prompt instructions to evaluate the impact of these factors on the simulation effect. 3. **Performance in a multilingual environment** The researchers further analyzed the performance of LLMs in English, Chinese, and Russian conversations and found that the models' performance in these three languages is similar, but still generally low. 4. **Under which circumstances are LLMs more likely to effectively simulate human responses?** The researchers explored, through regression analysis, which factors enable LLMs to better simulate human responses. The results show that when the initial conversation style of humans is closer to the style of the LLM itself, the LLM can better match the subsequent conversation behavior. ### Main contributions 1. **Proposing a general evaluation framework** The researchers introduced a general evaluation framework for meaningfully analyzing human - LLM simulations and provided a new dataset containing more than 1,200 annotator responses for comparison with human - level performance. 2. **Large - scale analysis** A large - scale analysis was carried out on 9 LLM models simulating 2,000 English human - LLM conversations under 50 prompt instructions. Even the optimal combination of models and prompts is relatively weak in simulating human behavior. 3. **Multilingual analysis** A multilingual analysis was carried out on 10,000 Chinese and Russian human - LLM conversations, and it was found that the performance is roughly similar but still low. 4. **Regression analysis** Through regression analysis, the researchers revealed which factors cause the responses of LLMs to be closer to humans. The results show that when humans start a conversation in a writing style similar to that of the LLM, the LLM can better match the subsequent conversation behavior. ### Conclusion Through large - scale experiments and multi - angle analysis, this paper reveals the limitations and potential improvement directions of LLMs in simulating human conversations. The research results have important reference value for the future development of more effective conversation systems.