Frequency Enhancement Network for Efficient Compressed Video Action Recognition

Yue Ming,Lu Xiong,Xia Jia,Qingfang Zheng,Jiangwan Zhou,Fan Feng,Nannan Hu
DOI: https://doi.org/10.1109/icip49359.2023.10222848
2023-01-01
Abstract:The existing frequency-based action recognition methods achieve impressive performance in improving efficiency. However, they ignore the low-frequency texture and edge clues, leading to accuracy degradation. To address this problem, we propose a novel frequency enhancement (FE) block for efficient compressed video action recognition, including a temporal-channel two-heads attention (TCTHA) module and a frequency overlapping group convolution (FOGC) module. First, the TCTHA module emphasizes the inter-frame temporal context and the inner-frame informative frequency semantics by attention. Then, the FOGC module groups channels in different frequency bands with overlap, to extract low-frequency texture and edge clues, while maintaining the interaction of groups. We integrate the FE block into 2D-CNNs with frequency I-frame input, termed FENet, focusing on the pivotal low-frequency spatio-temporal semantics for action recognition. Experiments on HMDB-51, UCF-101, Kinetics-400, and Kinetics-700 verify that our FENet achieves comparable accuracy compared with RGB-based methods with high efficiency.
What problem does this paper attempt to address?