Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

Vittoria Dentella,Fritz Guenther,Evelina Leivada
2024-10-07
Abstract:Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.
Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to explore the differences between the performance of large language models (LLMs) and humans in handling natural language tasks, and to investigate whether these differences can be mitigated by increasing the scale of the models. Specifically, the authors tested the performance of three different scales of LLMs (Bard, ChatGPT-3.5, and ChatGPT-4) on grammatical judgment tasks, which include anaphora, center embedding, comparative sentences, and negative polarity items. By comparing the performance of these models with that of humans, the authors hope to answer the following questions: 1. **Accuracy**: Can increasing the number of model parameters improve its accuracy in grammatical judgment tasks? 2. **Stability**: Is the model's response more stable when the same task is presented repeatedly? 3. **Repetition Effect**: Does the model's accuracy and stability improve when the same task is presented multiple times? ### Main Findings 1. **Accuracy**: - All LLMs performed better in judging grammatically correct sentences. - Bard and ChatGPT-3.5 performed comparably. - ChatGPT-4 performed the best among all models, especially in judging grammatically correct sentences, with an accuracy rate of 93.5%. - For grammatically incorrect sentences, ChatGPT-4 also outperformed the other models, but there were still some errors. 2. **Stability**: - All LLMs were more stable in judging grammatically correct sentences. - Bard and ChatGPT-3.5 performed comparably. - ChatGPT-4 performed the best in terms of stability, but in some cases, its response variability was greater than that of humans. 3. **Repetition Effect**: - Bard showed improvement in performance for grammatically correct sentences when the same task was presented multiple times. - ChatGPT-3.5's performance declined when the same task was presented multiple times. - ChatGPT-4 showed improvement in performance for grammatically correct sentences when the same task was presented multiple times, but its performance for grammatically incorrect sentences slightly declined. ### Conclusion Although increasing the scale of the models can significantly improve their performance in grammatical judgment tasks, these models still cannot fully reach human levels, especially in handling grammatically incorrect sentences. Additionally, the stability of the models' responses when the same task is presented multiple times is not as good as that of humans. Therefore, merely relying on increasing the model scale may not be sufficient to completely resolve these differences. The authors believe this may be due to the lack of key elements in the models' learning process that are present in human language learning, such as sensory information, pragmatic functions, and communicative intentions.