MLP Can Be A Good Transformer Learner

Sihao Lin,Pumeng Lyu,Dongrui Liu,Tao Tang,Xiaodan Liang,Andy Song,Xiaojun Chang
2024-04-09
Abstract:Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers, guided by entropy considerations. We identify that regarding the attention layer in bottom blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore, we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping, yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B, improving throughput and memory bound without performance compromise. Code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper discusses the efficiency issue of the self-attention mechanism in the Transformer model, especially its criticism in terms of computational requirements. The study found that the information carried by the attention layers in some lower blocks is lower than that in the top blocks, and low-entropy attention layers are usually accompanied by multi-layer perceptron (MLP) layers with similar information capacity. Therefore, the paper proposes a new strategy to directly remove unnecessary attention layers from the perspective of entropy, in order to reduce computational load and optimize memory usage without compromising performance. Specifically, the paper proposes a method called NOSE (Entropy-based Selection Strategy) to identify which attention layers can be integrated into the subsequent MLP layers by degrading the attention layers into identity mappings. Experimental results show that this method can remove 40% of attention layers in the DeiT-B model without reducing performance, while improving throughput and memory constraints. In addition, the paper also compares the effects of randomly selecting attention layers with the use of the NOSE method, demonstrating that NOSE can more effectively reduce the number of attention layers while maintaining performance. Experiments on ImageNet-1k, CIFAR-100, and ADE20k datasets validate the effectiveness of this method, especially in terms of memory efficiency and throughput. In conclusion, this paper proposes a novel framework to transfer the knowledge of non-critical attention layers to MLP layers, guiding the selection of attention layers through entropy quantification and transfer entropy, thereby simplifying the visual Transformer model and improving its computational efficiency.