IFF-Net: I-Frame Fusion Network for Compressed Video Action Recognition

Shaojie Li,Jinxin Guo,Jiaqiang Zhang,Xu Guo,Ming Ma
DOI: https://doi.org/10.1109/smc53992.2023.10394358
2023-01-01
Abstract:Compressed video action recognition has received significant attention due to its potential for reducing storage and computational costs. However, the current methods typically only capture a few RGBs and compressed motion cues (e.g., motion vectors and residuals), which are insufficient for modeling actions at their full temporal extent. To address this issue, we propose a Time Domain Fusion (TDF) Module that can extract both low-frequency and high-frequency components from the video and integrate them seamlessly, resulting in the effective integration of abundant motion information into a single frame. More importantly, by using the TDF module, we introduced a new network called I-Frame Fusion Network (IFF-Net). The IFF -Net interacts with the original network (I-frame, motion vector, and residual) in two ways: explicit and implicit. Explicit interaction involves extracting the new representation and the original compressed representation information separately and then performing a later fusion. In contrast, implicit interaction uses the distillation approach, with the IFF-Net acting as the teacher to guide the I-frame network to learn full temporal expressions. Our approach performs better than state-of-the-art methods on the UCF-101 and HMDB-51 datasets for compressed video action recognition.
What problem does this paper attempt to address?