Spatio-Temporal Collaborative Module for Efficient Action Recognition

Yanbin Hao,Shuo Wang,Yi Tan,Xiangnan He,Zhenguang Liu,Meng Wang
DOI: https://doi.org/10.1109/tip.2022.3221292
IF: 10.6
2022-01-01
IEEE Transactions on Image Processing
Abstract:Efficient action recognition aims to classify a video clip into a specific action category with a low computational cost. It is challenging since the integrated spatial-temporal calculation (e. g., 3D convolution) introduces intensive operations and increases complexity. This paper explores the feasibility of the integration of channel splitting and filter decoupling for efficient architecture design and feature refinement by proposing a novel spatio-temporal collaborative (STC) module. STC splits the video feature channels into two groups and separately learns spatio-temporal representations in parallel with decoupled convolutional operators. Particularly, STC consists of two computation-efficient blocks, i.e., $\text {S}_{\mathrm{ T}}$ and $\text {T}_{\mathrm{ S}}$ , where they extract either spatial ( ${S}_{\cdot }$ ) or temporal ( ${T}_{\cdot }$ ) features and further refine their features with either temporal ( $\cdot _{T}$ ) or spatial ( $\cdot _{S}$ ) contexts globally. The spatial/temporal context refers to information dynamics aggregated from temporal/spatial axis. To thoroughly examine our method’s performance in video action recognition tasks, we conduct extensive experiments using five video benchmark datasets requiring temporal reasoning. Experimental results show that the proposed STC networks achieve a competitive trade-off between model efficiency and effectiveness.
What problem does this paper attempt to address?