Abstract:The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at <a class="link-external link-https" href="https://github.com/tany0699/FMViT" rel="external noopener nofollow">this https URL</a>.

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

TransFG: A Transformer Architecture for Fine-Grained Recognition

RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained Image Recognition

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Attention-based Multi-scale ViT Fine-grained Visual Classification

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Fusion of regional and sparse attention in Vision Transformers

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

Constituent Attention for Vision Transformers

Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

FMViT: A multiple-frequency mixing Vision Transformer

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition