Abstract:Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.

Convolutional Transformer with Similarity-based Boundary Prediction for Action Segmentation.

Dilated Transformer with Feature Aggregation Module for Action Segmentation

Do We Really Need Temporal Convolutions in Action Segmentation?

TransVOS: Video Object Segmentation with Transformers

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Temporal Segment Transformer for Action Segmentation

ConvTransformer Attention Network for temporal action detection

ASFormer: Transformer for Action Segmentation

Efficient Action Recognition with Introducing R(2+1)D Convolution to Improved Transformer

Pyramid Dilated Attention Network for Action Segmentation

Convolutional transformer network for fine-grained action recognition

Local–Global Transformer Neural Network for Temporal Action Segmentation

LASFormer: Light Transformer for Action Segmentation with Receptive Field-Guided Distillation and Action Relation Encoding

Prototypical Transformer for Weakly Supervised Action Segmentation.

Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Fast and Unsupervised Action Boundary Detection for Action Segmentation

Boundary-sensitive Denoised Temporal Reasoning Network for Video Action Segmentation

LGAFormer: transformer with local and global attention for action detection

ActionFormer: Localizing Moments of Actions with Transformers

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Compressed Video Action Recognition with Dual-Stream and Dual-Modal Transformer