Abstract:Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.

Vision transformer with multiple granularities for person re-identification

Person Re-identification Based on Transform Algorithm

RETRACTED CHAPTER: Person Re-identification Based on Transform Algorithm

Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification

Transformer-based Feature Interactor for Person Re-Identification with Margin Self-Punishment Loss

Point-level feature learning based on vision transformer for occluded person re-identification

Video-based person re-identification with complementary local and global features using a graph transformer

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification

MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

Skip Connection Aggregation Transformer for Occluded Person Reidentification

Person Retrieval with Conv-Transformer.

Part-Aware Transformer for Generalizable Person Re-identification

Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

Multi-modal person re-identification based on transformer relational regularization

Multi-Scale Transformer-Based Matching Network for Generalizable Person Re-Identification

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification

Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

Learning Discriminative Features with Multiple Granularities for Person Re-Identification

Learning transformer-based attention region with multiple scales for occluded person re-identification