Abstract:Compressed video action recognition classifies actions using multiple features stored in compressed videos to omit the decoding process for RGB frames and shorten the computation time. Previous methods mostly used multiple networks to process compressed video features and explored the use of lightweight networks without affecting accuracy to reduce the computational complexity further. We have focused on another approach that uses only one network to reduce computational complexity. Our previous study proposed the MussNet model, which consists of independent subnetworks within a single network instead of multiple networks. The subnetworks classify compressed video features independently with a feedforwarding step of a single network and achieved competitive accuracy against previous studies with lower computational complexity. The remaining issue of the MussNet model is how to fuse the independently processed compressed video features. The current MussNet model makes independent predictions from each input and only averages them to fuse the inputs. However, recent studies have shown that intermediate fusion, which fuses features inside the networks, improves accuracy. This study proposes the EFS module that extends the MussNet model into intermediate fusion by disentangling and aggregating the features of the same videos in the hidden vectors while keeping the individual subnetworks. Our experiments show that the EFS module improves the MussNet model's accuracy by 0.4 points for UCF-101 and 1.0 points for HMDB-51, while the additional GFLOPs are only 1% of the MussNet model. These accuracy scores are also competitive against previous studies while keeping one of the lowest computational complexity.

LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

Joint Feature Optimization and Fusion for Compressed Action Recognition

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition

3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

A simulated two-stream network via multilevel distillation of reviewed features and decoupled logits for video action recognition

Multi-Stream Single Network: Efficient Compressed Video Action Recognition With a Single Multi-Input Multi-Output Network

A Real-Time Action Representation With Temporal Encoding and Deep Compression

MV2Flow

DC3D: A Video Action Recognition Network Based on Dense Connection

Advancing Compressed Video Action Recognition through Progressive Knowledge Distillation

MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB Sensors

Frequency Enhancement Network for Efficient Compressed Video Action Recognition

F2D-SIFPNet: a Frequency 2D Slow-I-Fast-P Network for Faster Compressed Video Action Recognition

AE-Net:Adjoint Enhancement Network for Efficient Action Recognition in Video Understanding

Learning Comprehensive Motion Representation for Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation