FMViT: A multiple-frequency mixing Vision Transformer

Wei Tan,Yifeng Geng,Xuansong Xie
DOI: https://doi.org/10.48550/arXiv.2311.05707
2023-11-10
Abstract:The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at <a class="link-external link-https" href="https://github.com/tany0699/FMViT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the efficiency and performance of visual Transformer models in actual industrial deployment scenarios, especially on platforms such as TensorRT and CoreML. Specifically, existing visual Transformers (ViTs) usually perform poorly on these platforms due to the high time and memory complexity of the self - attention mechanism and cannot be compared with traditional convolutional neural networks (CNNs). Although some studies attempt to solve this problem by designing CNN - Transformer hybrid architectures, the overall performance is still not satisfactory. To address these challenges, the authors propose an efficient hybrid ViT architecture - FMViT. FMViT enhances the model's expressive ability and reduces computational overhead through the following methods: 1. **Multi - Frequency Blending Module (FMB)**: By fusing features of different frequencies (high - frequency features and low - frequency features), the model can more effectively capture local and global information. 2. **Lightweight Convolution Blending Module (CFB)**: Efficiently combines the local modeling ability of convolution, and uses convolutional multi - group re - parameterization to further improve the modeling performance. 3. **Convolutional multi - group re - parameterization**: Integrates the spatial information of different sub - channels during the training phase and fuses them into one convolution during the inference phase, thereby improving the model's accuracy while maintaining the inference speed. 4. **Lightweight Multi - Head Self - Attention Module (RLMHSA)**: Adopts a lightweight and re - parameterized design to enhance the global modeling ability of the module and accelerate the inference speed. Experimental results show that FMViT outperforms existing CNNs, ViTs, and CNN - Transformer hybrid architectures in multiple visual tasks and performs well in the trade - off between latency and accuracy. For example, on the TensorRT platform, the top - 1 accuracy of FMViT on the ImageNet dataset is 2.5% higher than that of ResNet101 while maintaining a similar inference latency; on the CoreML platform, the top - 1 accuracy of FMViT is 2.6% higher than that of MobileOne, and the inference latency is comparable.