Abstract:Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.

Multi-dimensional convolution transformer for group activity recognition

Learning Visual Context for Group Activity Recognition.

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Detector-Free Weakly Supervised Group Activity Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

A Multi-Stream Convolutional Neural Network Framework for Group Activity Recognition

Dynamical Attention Hypergraph Convolutional Network for Group Activity Recognition

Convolutional transformer network for fine-grained action recognition

A human activity recognition method based on Vision Transformer

LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition

Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition

MULTI-DIRECTIONAL CONVOLUTION NETWORKS WITH SPATIAL-TEMPORAL FEATURE PYRAMID MODULE FOR ACTION RECOGNITION

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Transformer With Bidirectional GRU for Nonintrusive, Sensor-Based Activity Recognition in a Multiresident Environment

Multi-scale residual network model combined with Global Average Pooling for action recognition

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Modeling transformer architecture with attention layer for human activity recognition

Multi-scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Multi-scale Context-aware Network with Transformer for Gait Recognition