HLB: Benchmarking LLMs' Humanlikeness in Language Use

Xufeng Duan,Bei Xiao,Xuemei Tang,Zhenguang G. Cai
2024-09-24
Abstract:As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see <a class="link-external link-https" href="https://huggingface.co/spaces/XufengDuan/HumanLikeness" rel="external noopener nofollow">this https URL</a>). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the "humanlikeness" of large language models (LLMs) in actual language use. With the wide application of synthetic data in training language models, especially in generating dialogues, these models may deviate from real human language patterns, thus losing the richness and creativity inherent in human communication. Therefore, there is an urgent need for a systematic evaluation framework to measure the degree of similarity between these models and humans in language use. Specifically, this research aims to: 1. **Establish a comprehensive benchmark (HLB)**: By designing 10 psycholinguistic experiments covering five core language levels, namely phonetics, vocabulary, syntax, semantics, and discourse, to evaluate the performance of 20 large language models. 2. **Collect and compare human and model data**: By collecting responses from more than 2,000 human participants and comparing them with the outputs of LLMs, quantify the similarity between the models and humans at different language levels. 3. **Introduce new evaluation methods**: Use psycholinguistic methods to evaluate LLMs, providing a systematic framework to more comprehensively understand how these models process and generate language. 4. **Reveal subtle differences in model performance**: The study found that although some models have improvements in other performance indicators, this does not necessarily mean that they have improved in terms of humanlikeness, and in some cases, they have even declined. In short, the core problem of this paper is to develop a new evaluation standard to ensure that LLMs not only perform well in task completion but also are closer to human language use, especially when facing complex and diverse language phenomena.