Abstract:Compressed video action recognition classifies actions using multiple features stored in compressed videos to omit the decoding process for RGB frames and shorten the computation time. Previous methods mostly used multiple networks to process compressed video features and explored the use of lightweight networks without affecting accuracy to reduce the computational complexity further. We have focused on another approach that uses only one network to reduce computational complexity. Our previous study proposed the MussNet model, which consists of independent subnetworks within a single network instead of multiple networks. The subnetworks classify compressed video features independently with a feedforwarding step of a single network and achieved competitive accuracy against previous studies with lower computational complexity. The remaining issue of the MussNet model is how to fuse the independently processed compressed video features. The current MussNet model makes independent predictions from each input and only averages them to fuse the inputs. However, recent studies have shown that intermediate fusion, which fuses features inside the networks, improves accuracy. This study proposes the EFS module that extends the MussNet model into intermediate fusion by disentangling and aggregating the features of the same videos in the hidden vectors while keeping the individual subnetworks. Our experiments show that the EFS module improves the MussNet model's accuracy by 0.4 points for UCF-101 and 1.0 points for HMDB-51, while the additional GFLOPs are only 1% of the MussNet model. These accuracy scores are also competitive against previous studies while keeping one of the lowest computational complexity.

IFF-Net: I-Frame Fusion Network for Compressed Video Action Recognition

F2D-SIFPNet: a Frequency 2D Slow-I-Fast-P Network for Faster Compressed Video Action Recognition

Joint Feature Optimization and Fusion for Compressed Action Recognition

A Slow-I-Fast-P Architecture for Compressed Video Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

Frequency Enhancement Network for Efficient Compressed Video Action Recognition

MTRFN: Multiscale Temporal Receptive Field Network for Compressed Video Action Recognition at Edge Servers

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Compressed Video Action Recognition with Dual-Stream and Dual-Modal Transformer

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

Multi-Stream Single Network: Efficient Compressed Video Action Recognition With a Single Multi-Input Multi-Output Network

Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition

Time‐attentive fusion network: An efficient model for online detection of action start

MIE-Net: Motion Information Enhancement Network for Fine-Grained Action Recognition Using RGB Sensors

IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation

Frame Flexible Network

GFNet: A Lightweight Group Frame Network for Efficient Human Action Recognition.

Fine-gained Motion Enhancement for Action Recognition: Focusing on Action-Related Regions

Coarse-to-Fine Spatio-Temporal Information Fusion for Compressed Video Quality Enhancement

Physical Knowledge Driven Multi-scale Temporal Receptive Field Network for Compressed Video Action Recognition

Research on Diverse Feature Fusion Network Based on Video Action Recognition