Emergence of a High-Dimensional Abstraction Phase in Language Transformers

Emily Cheng,Diego Doimo,Corentin Kervadec,Iuri Macocco,Jade Yu,Alessandro Laio,Marco Baroni
2024-05-24
Abstract:A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.
Computation and Language
What problem does this paper attempt to address?
This paper explores the evolution of intrinsic dimensions (ID) in language models (LM), particularly in a high-dimensional abstraction phase in pre-trained Transformer LMs. The study reveals several characteristics of this phase: 1. At this stage, the representations first form complete language abstractions of the input. 2. These representations can effectively transfer to downstream tasks. 3. The highest-dimensional representations of different LMs can predict each other, but not the representations from early and later layers. 4. The early appearance of the high-dimensional phase is correlated with better language modeling performance. The paper analyzes multiple pre-trained models and input datasets to uncover the peak of ID in processing layers, which decreases or does not exist in random text and untrained models. The authors also note that the ID peak is closely related to the performance of grammar and semantic probing tasks as well as NLP downstream tasks. Furthermore, the research demonstrates that all analyzed Transformer architectures develop a high-dimensional representation in intermediate layers, encoding complex abstract language information. These processing outputs are stored in the representations and may be used to predict the next word through a step-wise refinement process. Additionally, the paper mentions that although the internal workings of modern LMs are opaque, compression is crucial for generalizable representation learning. Low ID tasks and datasets are typically easier to learn, and attention has been given to the ID of LM parameters and activation space. The researchers employ the nonlinear ID estimation method GRIDE to estimate the ID of each layer's representation manifold, aiming to understand the relationship between layer geometry and layer functionality. In summary, the experimental results of this work indicate that the core language processing in many common LM architectures is based on a central high-dimensional phase.