Efficient spatio-temporal network for action recognition
Su, Yanxiong
DOI: https://doi.org/10.1007/s11554-024-01541-6
IF: 2.293
2024-08-25
Journal of Real-Time Image Processing
Abstract:The input tensor of video data includes temporal, spatial, and channel dimensions, crucial for extracting complementary spatial, temporal, and spatio-temporal features for video action recognition. To efficiently extract and integrate these features, we propose an efficient spatio-temporal module (ESTM) with three pathways dedicated to extracting spatial, temporal, and spatio-temporal features. Each pathway uses the Cross Global Average Pooling (CGAP) module to compress the current dimension, focusing features on the remaining two dimensions. This enhances feature extraction and recognition rates for complex actions. We also introduce a Motion Excitation Module (MEM) to enrich input features by transforming correlations between adjacent frames, reducing computational complexity. Finally, ESTM and MEM are seamlessly integrated into a 2D CNN, forming the efficient spatio-temporal network (ESTN), with minimal impact on network parameters and computational costs. Extensive experiments show that ESTN outperforms state-of-the-art methods on datasets like Something V1 & V2 and HMDB51, validating its effectiveness.
computer science, artificial intelligence,engineering, electrical & electronic,imaging science & photographic technology