Abstract:Compressed video action recognition offers the advantage of reducing decoding and inference time compared to the RGB domain. However, the compressed domain poses unique challenges with different types of frames (I-frames and P-frames). I-frames consistent with RGB are rich in frame information, but the redundant information may interfere with the recognition task. There are two modalities in P-frames, residual (R) and motion vector (MV). Although with less information, they can reflect the motion cue. To address these challenges and leverage the independent information from different frames and modalities, we propose a novel approach called Dual-Stream and Dual-Modal Transformer (DSDMT). Our approach consists of two streams: 1) The short-span P-frames stream contains temporal information. We propose the Dual-Modal Attention Module (DAM) to mine different modal variability in P-frames and complement the orthogonal feature vector. Besides, considering the sparsity of P-frames, we extract action features with Frame-level Patch Embedding (FPE) to avoid redundant computation. 2) The long-span I-frames stream extracts the global context feature of the entire video, including content and scene information. By fusing the global video context and local key-frame features, our model represents the action feature in terms of fine-grained and coarse-grained. We evaluated our proposed DSDMT on three public benchmarks with different scales: HMDB-51, UCF-101, and Kinetics-400. Ours achieve better performance with fewer Flops and lower latency. Our analysis shows that the independence and complements of the I-frames and P-frames extracted from the compressed video stream play a crucial role in action recognition.

Compressed Video Action Recognition Using Motion Vector Representation.

Compressed Video Action Recognition with Refined Motion Vector

A Method of Simultaneously Action Recognition and Video Segmentation of Video Streams.

Action Recognition with Stacked Fisher Vectors.

Compressed Video Action Recognition with Dual-Stream and Dual-Modal Transformer

Towards Practical Compressed Video Action Recognition: A Temporal Enhanced Multi-Stream Network

META: Motion Excitation with Temporal Attention for Compressed Video Action Recognition

Dynamic Spatial Focus for Efficient Compressed Video Action Recognition

Physical Knowledge Driven Multi-scale Temporal Receptive Field Network for Compressed Video Action Recognition

Joint Feature Optimization and Fusion for Compressed Action Recognition

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition

MV2Flow

Representation Learning for Compressed Video Action Recognition Via Attentive Cross-modal Interaction with Motion Enhancement.

Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling.

MTRFN: Multiscale Temporal Receptive Field Network for Compressed Video Action Recognition at Edge Servers

An Efficient Motion Visual Learning Method for Video Action Recognition

Video Action Recognition with Adaptive Zooming Using Motion Residuals

Recognizing Violent Activity Without Decoding Video Streams

Action recognition in compressed domains: A survey

An Efficient Compressed Domain Moving Object Segmentation Algorithm Based on Motion Vector Field

Real-Time Action Recognition with Enhanced Motion Vector CNNs