Abstract:Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for developing more intelligent LLMs.

Large Language Models Lack Understanding of Character Composition of Words

Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?

CUTE: Measuring LLMs' Understanding of Their Tokens

LLMs' Understanding of Natural Language Revealed

Can large language models understand uncommon meanings of common words?

How Well Do Large Language Models Understand Syntax? An Evaluation by Asking Natural Language Questions

Large Language Models aren't all that you need

Large Language Models Are In-Context Semantic Reasoners Rather Than Symbolic Reasoners

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Interpreting token compositionality in LLMs: A robustness analysis

Can Large Language Models Identify Authorship?

Exploring the Limitations of Large Language Models in Compositional Relation Reasoning

Evaluating Morphological Compositional Generalization in Large Language Models

The Importance of Understanding Language in Large Language Models

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

From Words to Worlds: Compositionality for Cognitive Architectures

Large language models effectively leverage document-level context for literary translation, but critical errors persist

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

Large Language Models Are Not Strong Abstract Reasoners

Large Language Models: A Survey