Your Transformer is Secretly Linear

Anton Razzhigaev,Matvey Mikhalchuk,Elizaveta Goncharova,Nikolai Gerasimenko,Ivan Oseledets,Denis Dimitrov,Andrey Kuznetsov
2024-05-20
Abstract:This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper explores the linear properties in the Transformer decoder, particularly in models like GPT, LLaMA, OPT, BLOOM, etc. It is found that there exists an almost perfect linear relationship (Procrustes similarity up to 0.99) among the embedding transformations across layers in these models. However, the linearity decreases when the residual component is removed due to the lower output norm of the Transformer layers. The authors experimentally demonstrate that removing or approximating the most linear blocks has little impact on the model's loss and performance. They also propose a regularization method based on cosine similarity to reduce the linearity of pretrained models, which improves performance metrics in benchmark tests such as TinyStories and SuperGLUE and successfully reduces the linearity of the models. The novel contributions presented in the paper include: 1. Analyzing the linearity properties and dynamics of the Transformer decoder during pretraining and fine-tuning stages. 2. Developing a deep pruning algorithm that allows removing the most linear layers without significant performance impact. 3. Introducing a novel distillation technique that involves pruning, replacing certain layers with linear approximations, and then distilling the layer embeddings to maintain model performance. 4. Introducing a pretraining regularization method based on cosine similarity aiming to reduce layer linearity and enhance model performance on benchmark tasks. These findings challenge the traditional understanding of the Transformer architecture, suggesting that its operations may be more linear than previously thought. Through these research efforts, the paper opens up new avenues for building more efficient and effective Transformer architectures, addressing one of the key challenges in deploying these models.