Abstract:The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively. Additionally, we introduce deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization (gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional Fusion Block (CFB) to further improve the model's performance and reduce computational overhead. Our experiments demonstrate that FMViT surpasses existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of latency/accuracy trade-offs for various vision tasks. On the TensorRT platform, FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the ImageNet dataset while maintaining similar inference latency. Moreover, FMViT achieves comparable performance with EfficientNet-B5, but with a 43% improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6% in top-1 accuracy on the ImageNet dataset, with inference latency comparable to MobileOne (78.5% vs. 75.9%). Our code can be found at <a class="link-external link-https" href="https://github.com/tany0699/FMViT" rel="external noopener nofollow">this https URL</a>.

A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer.

TransVOS: Video Object Segmentation with Transformers

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective

Joint Structured Sparsity Regularized Multiview Dimension Reduction for Video-Based Facial Expression Recognition.

Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection

FMViT: A multiple-frequency mixing Vision Transformer

Fine-Grained Temporal-Enhanced Transformer for Dynamic Facial Expression Recognition

Exploring Temporal Coherence for More General Video Face Forgery Detection

Multi-target video-based face recognition and gesture recognition based on enhanced detection and multi-trajectory incremental learning

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

Visformer: The Vision-friendly Transformer

MViT: Mask Vision Transformer for Facial Expression Recognition in the Wild

VidFace: A Full-Transformer Solver for Video FaceHallucination with Unaligned Tiny Snapshots