Abstract:This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.

What problem does this paper attempt to address?

This paper explores the linear properties in the Transformer decoder, particularly in models like GPT, LLaMA, OPT, BLOOM, etc. It is found that there exists an almost perfect linear relationship (Procrustes similarity up to 0.99) among the embedding transformations across layers in these models. However, the linearity decreases when the residual component is removed due to the lower output norm of the Transformer layers. The authors experimentally demonstrate that removing or approximating the most linear blocks has little impact on the model's loss and performance. They also propose a regularization method based on cosine similarity to reduce the linearity of pretrained models, which improves performance metrics in benchmark tests such as TinyStories and SuperGLUE and successfully reduces the linearity of the models. The novel contributions presented in the paper include: 1. Analyzing the linearity properties and dynamics of the Transformer decoder during pretraining and fine-tuning stages. 2. Developing a deep pruning algorithm that allows removing the most linear layers without significant performance impact. 3. Introducing a novel distillation technique that involves pruning, replacing certain layers with linear approximations, and then distilling the layer embeddings to maintain model performance. 4. Introducing a pretraining regularization method based on cosine similarity aiming to reduce layer linearity and enhance model performance on benchmark tasks. These findings challenge the traditional understanding of the Transformer architecture, suggesting that its operations may be more linear than previously thought. Through these research efforts, the paper opens up new avenues for building more efficient and effective Transformer architectures, addressing one of the key challenges in deploying these models.

Your Transformer is Secretly Linear

Linear attention is (maybe) all you need (to understand transformer optimization)

Linear Transformers are Versatile In-Context Learners

The Devil in Linear Transformer

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

Improving Systematic Generalization of Linear Transformer Using Normalization Layers and Orthogonality Loss Function

Representational Strengths and Limitations of Transformers

Transformers need glasses! Information over-squashing in language tasks

Local Interpretation of Transformer Based on Linear Decomposition

From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport

Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context

Trained Transformers Learn Linear Models In-Context

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

Understanding the Difficulty of Training Transformers

Theoretical limitations of multi-layer Transformer

The geometry of hidden representations of large transformer models

Graph Transformers Dream of Electric Flow

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Linearizing Transformer with Key-Value Memory