Abstract:The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

Measuring and Improving Consistency in Pretrained Language Models

Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models

Factual Consistency of Multilingual Pretrained Language Models

Improving Language Models Meaning Understanding and Consistency by Learning Conceptual Roles from Dictionary

From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

How often are errors in natural language reasoning due to paraphrastic variability?

Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models

Semantic Consistency for Assuring Reliability of Large Language Models

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models

A Comprehensive Evaluation of Semantic Relation Knowledge of Pretrained Language Models and Humans

Towards Consistent Language Models Using Declarative Constraints

On Measuring Faithfulness or Self-consistency of Natural Language Explanations

The Effect of Scaling, Retrieval Augmentation and Form on the Factual Consistency of Language Models

Predicting Question-Answering Performance of Large Language Models through Semantic Consistency

The Queen of England is not England's Queen: On the Lack of Factual Coherency in PLMs

Aligning with Logic: Measuring, Evaluating and Improving Logical Consistency in Large Language Models

A Close Look into the Calibration of Pre-trained Language Models.

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Are Large Language Models Consistent over Value-laden Questions?