Abstract:Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.

Action-Transformer for Action Recognition in Short Videos

Convolutional transformer network for fine-grained action recognition

Efficient Action Recognition with Introducing R(2+1)D Convolution to Improved Transformer

Sparse Dense Transformer Network for Video Action Recognition

Hierarchy Spatial-Temporal Transformer for Action Recognition in Short Videos

Sparse Transformer-Based Algorithm for Long-Short Temporal Association Action Recognition

Short-Term Action Recognition by 3D Convolutional Neural Network with Pixel-Wise Evidences

Transformer-Based Multiview Deep Feature Learning for Action Recognition in Depth Videos

A Multi-Modal Transformer Network for Action Detection

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Efficient Video Transformers via Spatial-Temporal Token Merging for Action Recognition

Human action recognition with transformer based on convolutional features

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

ActionFormer: Localizing Moments of Actions with Transformers

K-Nn Attention-Based Video Vision Transformer for Action Recognition

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

TxVAD: Improved Video Action Detection by Transformers

Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

FSConformer: A Frequency-Spatial-Domain CNN-Transformer Two-Stream Network for Compressed Video Action Recognition

Relative-position Embedding Based Spatially and Temporally Decoupled Transformer for Action Recognition