Abstract:The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at <a class="link-external link-https" href="https://github.com/tany0699/FMViT" rel="external noopener nofollow">this https URL</a>.

Adaptive Masked Autoencoder Transformer for Image Classification

GhostViT: Expediting Vision Transformers Via Cheap Operations

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Towards Efficient Adversarial Training on Vision Transformers

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

FDViT: Improve the Hierarchical Architecture of Vision Transformer.

MViT: Mask Vision Transformer for Facial Expression Recognition in the Wild

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Improving Vision Transformers by Revisiting High-Frequency Components

Super Vision Transformer

Transformer with token attention and attribute prediction for image captioning

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

DctViT: Discrete Cosine Transform Meet Vision Transformers

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Masked autoencoders are effective solution to transformer data-hungry

CageViT: Convolutional Activation Guided Efficient Vision Transformer

EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Edge Devices

FMViT: A multiple-frequency mixing Vision Transformer

Adaptive Token Sampling For Efficient Vision Transformers