Abstract:Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization

Cross-scale Vision Transformer for crowd localization

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Multiscale Vision Transformer With Deep Clustering-Guided Refinement for Weakly Supervised Object Localization

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

A Multi-Modal Transformer Network for Action Detection

Multi-granularity transformer fusion for temporal action localization

Local-to-Global Self-Attention in Vision Transformers

STDM-transformer: Space-time dual multi-scale transformer network for skeleton-based action recognition

A unified framework for unsupervised action learning via global-to-local motion transformer

End-to-End Spatio-Temporal Action Localisation with Video Transformers

Localization and recognition of human action in 3D using transformers

Multi-manifold Attention for Vision Transformers

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

DVANet: Disentangling View and Action Features for Multi-View Action Recognition

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

An Effective-Efficient Approach for Dense Multi-Label Action Detection

Multi-Scale Adaptive Skeleton Transformer for action recognition

Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints