Abstract:Large transformers are powerful architectures used for self-supervised data analysis across various data types, including protein sequences, images, and text. In these models, the semantic structure of the dataset emerges from a sequence of transformations between one representation and the next. We characterize the geometric and statistical properties of these representations and how they change as we move through the layers. By analyzing the intrinsic dimension (ID) and neighbor composition, we find that the representations evolve similarly in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic information of the dataset is better expressed at the end of the first peak, and this phenomenon can be observed across many models trained on diverse datasets. Based on our findings, we point out an explicit strategy to identify, without supervision, the layers that maximize semantic content: representations at intermediate layers corresponding to a relative minimum of the ID profile are more suitable for downstream learning tasks.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to understand how the geometric and statistical properties of the hidden representations in large Transformer models evolve when processing different data types (such as protein sequences, images, and text). Specifically, the authors focus on the following issues: 1. **Geometric properties of hidden representations**: How to understand the evolution of these representations in different layers by analyzing the changes in intrinsic dimension (ID) and neighborhood structure. 2. **Extraction of semantic information**: Determine which layers' representations can best capture the semantic information of the dataset, especially during the self - supervised training process. 3. **Internal mechanisms of the model**: Explore how Transformer models generate meaningful representations through three stages of data expansion, compression, and decoding when performing self - supervised tasks. ### Main research content - **Evolution of intrinsic dimension (ID)**: - The study found that in the early layers, the data manifold expands rapidly, forming a high - dimensional space; then it contracts significantly in the middle layers, entering a low - dimensional space. In the final part of the model, the ID remains relatively stable or forms a second shallow peak. - This pattern is consistent in Transformer models for protein language tasks and image reconstruction tasks. - **Distribution of semantic information**: - Semantic information is most abundant in the middle layers after the first peak. For example, in protein language models, the information of remote homology relationships reaches its maximum during the plateau period of the ID curve. - For image Transformer models, semantic features (such as class labels) are most obvious in the layers where the ID is relatively the smallest. - **Unsupervised strategy**: - An unsupervised method is proposed to identify the layers carrying the most semantic information: select the middle layers corresponding to the relatively lowest points on the ID curve for downstream learning tasks, and these layers are more suitable as feature extractors. ### Conclusion Through the study of the geometric properties and semantic information distribution of large Transformer models, the authors revealed how these models gradually construct abstract and meaningful data representations during the self - supervised learning process. In particular, they proposed an unsupervised method to locate the optimal semantic representation layers, which is of great significance for improving the performance of downstream tasks. ### Formula summary - **Calculation of intrinsic dimension (ID)**: Intrinsic dimension can be calculated by the TwoNN estimator, and the formula is as follows: \[ \mu_i=\frac{r_{i2}}{r_{i1}} \] where \(r_{i1}\) and \(r_{i2}\) are the distances from point \(x_i\) to its first and second nearest neighbors respectively. According to the local constant density assumption, \(\mu_i\) follows a Pareto distribution, and the shape parameter is equal to ID. - **Neighborhood overlap**: Neighborhood overlap is used to measure the similarity between two representations, and is defined as: \[ \chi_{l,m}^k = \frac{1}{N}\sum_{i = 1}^N\frac{1}{k}\sum_{j = 1}^kA_{ij}^lA_{ij}^m \] where \(A_{ij}^l\) is the adjacency matrix of the \(l\) - th layer, indicating whether the first \(k\) nearest neighbors of point \(x_i\) contain point \(x_j\). These methods and conclusions provide important clues for in - depth understanding of the working mechanisms of Transformer models and theoretical support for improving the applications of these models.

The geometry of hidden representations of large transformer models

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

Transformers represent belief state geometry in their residual stream

Representational Strengths and Limitations of Transformers

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

A mathematical perspective on Transformers

Unleashing the Power of Transformer for Graphs

Representations as Language: An Information-Theoretic Framework for Interpretability

Transformer Layers as Painters

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data

The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

Understanding Video Transformers via Universal Concept Discovery

Intriguing Equivalence Structures of the Embedding Space of Vision Transformers

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Unveiling Transformer Perception by Exploring Input Manifolds

How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

Deep Transformers with Latent Depth

Do Transformers Really Perform Bad for Graph Representation?

VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers

Separations in the Representational Capabilities of Transformers and Recurrent Architectures