Abstract:Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.

Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling.

Compressed Video Action Recognition Using Motion Vector Representation.

SOR-TC: Self-attentive Octave ResNet with Temporal Consistency for Compressed Video Action Recognition

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

META: Motion Excitation with Temporal Attention for Compressed Video Action Recognition

Compressed Video Action Recognition with Dual-Stream and Dual-Modal Transformer

Self-supervised pretext task collaborative multi-view contrastive learning for video action recognition

Representation Learning for Compressed Video Action Recognition Via Attentive Cross-modal Interaction with Motion Enhancement.

Dynamic Spatial Focus for Efficient Compressed Video Action Recognition

Joint Feature Optimization and Fusion for Compressed Action Recognition

Physical Knowledge Driven Multi-scale Temporal Receptive Field Network for Compressed Video Action Recognition

Compressed Video Action Recognition with Refined Motion Vector

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Video Action Recognition with Adaptive Zooming Using Motion Residuals

An Efficient Motion Visual Learning Method for Video Action Recognition

Action recognition in compressed domains: A survey

Learning Comprehensive Motion Representation for Action Recognition

Video Action Recognition with Attentive Semantic Units

Memory-augmented Dense Predictive Coding for Video Representation Learning

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Self-Supervised Learning of Video Representation for Anticipating Actions in Early Stage