Abstract:Understanding the limits of language is a prerequisite for Large Language Models (LLMs) to act as theories of natural language. LLM performance in some language tasks presents both quantitative and qualitative differences from that of humans, however it remains to be determined whether such differences are amenable to model size. This work investigates the critical role of model scaling, determining whether increases in size make up for such differences between humans and models. We test three LLMs from different families (Bard, 137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a grammaticality judgment task featuring anaphora, center embedding, comparatives, and negative polarity. N=1,200 judgments are collected and scored for accuracy, stability, and improvements in accuracy upon repeated presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are compared to results of n=80 humans on the same stimuli. We find that humans are overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but that this is due to ChatGPT-4 outperforming humans only in one task condition, namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer, respectively). Thus, while increased model size may lead to better performance, LLMs are still not sensitive to (un)grammaticality the same way as humans are. It seems possible but unlikely that scaling alone can fix this issue. We interpret these results by comparing language learning in vivo and in silico, identifying three critical differences concerning (i) the type of evidence, (ii) the poverty of the stimulus, and (iii) the occurrence of semantic hallucinations due to impenetrable linguistic reference.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to explore the differences between the performance of large language models (LLMs) and humans in handling natural language tasks, and to investigate whether these differences can be mitigated by increasing the scale of the models. Specifically, the authors tested the performance of three different scales of LLMs (Bard, ChatGPT-3.5, and ChatGPT-4) on grammatical judgment tasks, which include anaphora, center embedding, comparative sentences, and negative polarity items. By comparing the performance of these models with that of humans, the authors hope to answer the following questions: 1. **Accuracy**: Can increasing the number of model parameters improve its accuracy in grammatical judgment tasks? 2. **Stability**: Is the model's response more stable when the same task is presented repeatedly? 3. **Repetition Effect**: Does the model's accuracy and stability improve when the same task is presented multiple times? ### Main Findings 1. **Accuracy**: - All LLMs performed better in judging grammatically correct sentences. - Bard and ChatGPT-3.5 performed comparably. - ChatGPT-4 performed the best among all models, especially in judging grammatically correct sentences, with an accuracy rate of 93.5%. - For grammatically incorrect sentences, ChatGPT-4 also outperformed the other models, but there were still some errors. 2. **Stability**: - All LLMs were more stable in judging grammatically correct sentences. - Bard and ChatGPT-3.5 performed comparably. - ChatGPT-4 performed the best in terms of stability, but in some cases, its response variability was greater than that of humans. 3. **Repetition Effect**: - Bard showed improvement in performance for grammatically correct sentences when the same task was presented multiple times. - ChatGPT-3.5's performance declined when the same task was presented multiple times. - ChatGPT-4 showed improvement in performance for grammatically correct sentences when the same task was presented multiple times, but its performance for grammatically incorrect sentences slightly declined. ### Conclusion Although increasing the scale of the models can significantly improve their performance in grammatical judgment tasks, these models still cannot fully reach human levels, especially in handling grammatically incorrect sentences. Additionally, the stability of the models' responses when the same task is presented multiple times is not as good as that of humans. Therefore, merely relying on increasing the model scale may not be sufficient to completely resolve these differences. The authors believe this may be due to the lack of key elements in the models' learning process that are present in human language learning, such as sensory information, pragmatic functions, and communicative intentions.

Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures

The Limitations of Large Language Models for Understanding Human Language and Cognition

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

Large Language Models Demonstrate the Potential of Statistical Learning in Language

Larger and more instructable language models become less reliable

Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency

Large language models effectively leverage document-level context for literary translation, but critical errors persist

A Survey of Large Language Models

How to Measure the Intelligence of Large Language Models?

Language models align with human judgments on key grammatical constructions

Scale matters: Large language models with billions (rather than millions) of parameters better match neural representations of natural language

A blind spot for large language models: Supradiegetic linguistic information

Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Do large language models show decision heuristics similar to humans? A case study using GPT-3.5.

The Importance of Understanding Language in Large Language Models

Limits for Learning with Language Models

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions