Abstract:Vision Transformer(ViT) is now dominating many vision tasks. The drawback of quadratic complexity of its token-wise multi-head self-attention (MHSA), is extensively addressed via either token sparsification or dimension reduction (in spatial or channel). However, the therein redundancy of MHSA is usually overlooked and so is the feed-forward network (FFN). To this end, we propose attention map hallucination and FFN compaction to fill in the blank. Specifically, we observe similar attention maps exist in vanilla ViT and propose to hallucinate half of the attention maps from the rest with much cheaper operations, which is called hallucinated-MHSA (hMHSA). As for FFN, we factorize its hidden-to-output projection matrix and leverage the re-parameterization technique to strengthen its capability, making it compact-FFN (cFFN). With our proposed modules, a 10$\%$-20$\%$ reduction of floating point operations (FLOPs) and parameters (Params) is achieved for various ViT-based backbones, including straight (DeiT), hybrid (NextViT) and hierarchical (PVT) structures, meanwhile, the performances are quite competitive.

What problem does this paper attempt to address?

The paper mainly addresses the issue of computational complexity in Vision Transformers (ViT), particularly the quadratic complexity problem brought by the Multi-Head Self-Attention (MHSA) mechanism and the high computational cost of the Feed-Forward Network (FFN) module. The authors propose a new method to solve these problems, which mainly includes two parts: 1. **Attention Map Hallucination**: - By observing the similarity and redundancy among attention maps in standard ViT, they propose hallucinated Multi-Head Self-Attention (hMHSA). It can hallucinate half of the attention maps from the other half using cheaper operations instead of obtaining all attention maps through expensive Query-Key correlation calculations. This helps reduce computational complexity. 2. **FFN Compaction**: - Addressing the often overlooked issue of the FFN module, they design a compact Feed-Forward Network (cFFN). This reduces the redundancy of the projection matrix from the hidden layer to the output layer in the FFN through matrix decomposition and uses reparameterization techniques to make the FFN more compact, thereby reducing computational costs while maintaining or enhancing performance. Through these two methods, the paper achieves a reduction of approximately 10%-20% in floating-point operations (FLOPs) and parameter counts (Params) on various ViT-based backbone networks while maintaining comparable performance. This method is applicable to different ViT structures, including direct connection types (e.g., DeiT), hybrid types (e.g., NextViT), and hierarchical types (e.g., PVT). Experimental results show that on the ImageNet classification task, models adopting these improved methods can significantly reduce the demand for computational resources while maintaining or even improving accuracy. Additionally, ablation studies further validate the effectiveness of the proposed methods, including the effectiveness of the attention map hallucination strategy and the advantages of the FFN compaction strategy. In summary, the methods proposed in this paper aim to improve the efficiency of Vision Transformers, especially in application scenarios with limited computational resources, by reducing unnecessary computational complexity to balance the model's performance and efficiency.

Vision Transformer with Attention Map Hallucination and FFN Compaction

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Constituent Attention for Vision Transformers

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

FMViT: A multiple-frequency mixing Vision Transformer

FViT: A Focal Vision Transformer with Gabor Filter

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Super Vision Transformer

Vision Transformer with Super Token Sampling

Vicinity Vision Transformer

Vision Transformer with Sparse Scan Prior

MaxViT: Multi-Axis Vision Transformer

Improving Vision Transformers by Revisiting High-Frequency Components

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

Fast Vision Transformers with HiLo Attention

RegionViT: Regional-to-Local Attention for Vision Transformers

You Only Need Less Attention at Each Stage in Vision Transformers

Fusion of regional and sparse attention in Vision Transformers

FAM: Improving columnar vision transformer with feature attention mechanism