Abstract:A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.

What problem does this paper attempt to address?

This paper explores the evolution of intrinsic dimensions (ID) in language models (LM), particularly in a high-dimensional abstraction phase in pre-trained Transformer LMs. The study reveals several characteristics of this phase: 1. At this stage, the representations first form complete language abstractions of the input. 2. These representations can effectively transfer to downstream tasks. 3. The highest-dimensional representations of different LMs can predict each other, but not the representations from early and later layers. 4. The early appearance of the high-dimensional phase is correlated with better language modeling performance. The paper analyzes multiple pre-trained models and input datasets to uncover the peak of ID in processing layers, which decreases or does not exist in random text and untrained models. The authors also note that the ID peak is closely related to the performance of grammar and semantic probing tasks as well as NLP downstream tasks. Furthermore, the research demonstrates that all analyzed Transformer architectures develop a high-dimensional representation in intermediate layers, encoding complex abstract language information. These processing outputs are stored in the representations and may be used to predict the next word through a step-wise refinement process. Additionally, the paper mentions that although the internal workings of modern LMs are opaque, compression is crucial for generalizable representation learning. Low ID tasks and datasets are typically easier to learn, and attention has been given to the ID of LM parameters and activation space. The researchers employ the nonlinear ID estimation method GRIDE to estimate the ID of each layer's representation manifold, aiming to understand the relationship between layer geometry and layer functionality. In summary, the experimental results of this work indicate that the core language processing in many common LM architectures is based on a central high-dimensional phase.

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

The geometry of hidden representations of large transformer models

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

The Impact of Depth on Compositional Generalization in Transformer Language Models

Anatomy of Neural Language Models

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

Dynamic Universal Approximation Theory: The Basic Theory for Transformer-based Large Language Models

A mathematical perspective on Transformers

Deep Transformers with Latent Depth

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

The Antecedents of Transformer Models

Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Transformers need glasses! Information over-squashing in language tasks

Hidden Holes: topological aspects of language models

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

A Survey on Large Language Models from Concept to Implementation