Abstract:As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see <a class="link-external link-https" href="https://huggingface.co/spaces/XufengDuan/HumanLikeness" rel="external noopener nofollow">this https URL</a>). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the "humanlikeness" of large language models (LLMs) in actual language use. With the wide application of synthetic data in training language models, especially in generating dialogues, these models may deviate from real human language patterns, thus losing the richness and creativity inherent in human communication. Therefore, there is an urgent need for a systematic evaluation framework to measure the degree of similarity between these models and humans in language use. Specifically, this research aims to: 1. **Establish a comprehensive benchmark (HLB)**: By designing 10 psycholinguistic experiments covering five core language levels, namely phonetics, vocabulary, syntax, semantics, and discourse, to evaluate the performance of 20 large language models. 2. **Collect and compare human and model data**: By collecting responses from more than 2,000 human participants and comparing them with the outputs of LLMs, quantify the similarity between the models and humans at different language levels. 3. **Introduce new evaluation methods**: Use psycholinguistic methods to evaluate LLMs, providing a systematic framework to more comprehensively understand how these models process and generate language. 4. **Reveal subtle differences in model performance**: The study found that although some models have improvements in other performance indicators, this does not necessarily mean that they have improved in terms of humanlikeness, and in some cases, they have even declined. In short, the core problem of this paper is to develop a new evaluation standard to ensure that LLMs not only perform well in task completion but also are closer to human language use, especially when facing complex and diverse language phenomena.

HLB: Benchmarking LLMs' Humanlikeness in Language Use

Human Simulacra: Benchmarking the Personification of Large Language Models

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

DHP Benchmark: Are LLMs Good NLG Evaluators?

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Do LLMs exhibit human-like response biases? A case study in survey design

Dissecting Human and LLM Preferences

A User-Centric Benchmark for Evaluating Large Language Models.

Evaluating Human-Language Model Interaction

How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation

Benchmarking Distributional Alignment of Large Language Models

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Modeling Human Subjectivity in LLMs Using Explicit and Implicit Human Factors in Personas

Are You Human? An Adversarial Benchmark to Expose LLMs

BeHonest: Benchmarking Honesty in Large Language Models

Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models

A Survey on Human-Centric LLMs