F2D-SIFPNet: a Frequency 2D Slow-I-Fast-P Network for Faster Compressed Video Action Recognition

Yue Ming,Jiangwan Zhou,Xia Jia,Qingfang Zheng,Lu Xiong,Fan Feng,Nannan Hu
DOI: https://doi.org/10.1007/s10489-024-05408-y
IF: 5.3
2024-01-01
Applied Intelligence
Abstract:Recent video action recognition methods directly use RGB pixels in the compressed domain. The cumbersome decoding process of traditional methods is avoided, enabling efficient recognition. However, these methods require converting the discrete cosine transform (DCT) frequency to an extended RGB pixel representation with heavy time consuming. To alleviate this drawback, a novel frequency 2D Slow-I-Fast-P network (F2D-SIFPNet) is proposed that significantly enhances the speed of action recognition. Initially, a new Frequency-Domain Partial Decompression (FPDec) method was designed for extracting the frequency domain DCT coefficients directly from the compressed video, eliminating the last time-consuming decoding process in FFmpeg. Subsequently, the Frequency-Domain Channel Selection (FCS) strategy was introduced for down-sampling the frequency-domain data, thereby augmenting the saliency of the input. Additionally, the Frequency Slow-I-Fast-P path (FSIFP) and the Adaptive Motion Excitation (AME) module were presented to emphasize the significant frequency components. FSIFP efficiently models slow spatial features and fast temporal changes simultaneously, while the AME generates an adaptive convolution kernel that captures both long-term and short-term motion cues. Extensive experiments were conducted on four public datasets: Kinetics-700, Kinetics-400, UCF-101, and HMDB-51. The results showed superior accuracies of 55.6 % , 74.0 % , 96.3 % and 74.6 % respectively, with preprocessing times being 6.31 times faster.
What problem does this paper attempt to address?