MLP Can Be A Good Transformer Learner

Sihao Lin,Pumeng Lyu,Dongrui Liu,Tao Tang,Xiaodan Liang,Andy Song,Xiaojun Chang

2024-04-09

Abstract:Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers, guided by entropy considerations. We identify that regarding the attention layer in bottom blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore, we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping, yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B, improving throughput and memory bound without performance compromise. Code is available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper discusses the efficiency issue of the self-attention mechanism in the Transformer model, especially its criticism in terms of computational requirements. The study found that the information carried by the attention layers in some lower blocks is lower than that in the top blocks, and low-entropy attention layers are usually accompanied by multi-layer perceptron (MLP) layers with similar information capacity. Therefore, the paper proposes a new strategy to directly remove unnecessary attention layers from the perspective of entropy, in order to reduce computational load and optimize memory usage without compromising performance. Specifically, the paper proposes a method called NOSE (Entropy-based Selection Strategy) to identify which attention layers can be integrated into the subsequent MLP layers by degrading the attention layers into identity mappings. Experimental results show that this method can remove 40% of attention layers in the DeiT-B model without reducing performance, while improving throughput and memory constraints. In addition, the paper also compares the effects of randomly selecting attention layers with the use of the NOSE method, demonstrating that NOSE can more effectively reduce the number of attention layers while maintaining performance. Experiments on ImageNet-1k, CIFAR-100, and ADE20k datasets validate the effectiveness of this method, especially in terms of memory efficiency and throughput. In conclusion, this paper proposes a novel framework to transfer the knowledge of non-critical attention layers to MLP layers, guiding the selection of attention layers through entropy quantification and transfer entropy, thereby simplifying the visual Transformer model and improving its computational efficiency.

MLP Can Be A Good Transformer Learner

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Pay Attention to MLPs

Less is More: Pay Less Attention in Vision Transformers

What Matters in Transformers? Not All Attention is Needed

Attention-Only Transformers and Implementing MLPs with Attention Heads

Reducing the Transformer Architecture to a Minimum

An Attention-Based Token Pruning Method for Vision Transformers

Lightweight transformer image feature extraction network

Transformer with sparse self‐attention mechanism for image captioning

Demystify Transformers & Convolutions in Modern Image Deep Networks

NiNformer: A Network in Network Transformer with Token Mixing as a Gating Function Generator

Lite Vision Transformer with Enhanced Self-Attention

Attention is All you Need

Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking

Representational Strengths and Limitations of Transformers

Adder Attention for Vision Transformer.

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems