Temporal Feature Aggregation for Efficient 2D Video Grounding

Mohan Chen,Yiren Zhang,Jueqi Wei,Yuejie Zhang,Rui Feng,Tao Zhang,Shang Gao
DOI: https://doi.org/10.1109/icme57554.2024.10687387
2024-01-01
Abstract:Video grounding aims to locate the target video moment in an untrimmed video based on a text query. Most existing methods employ 3D CNNs as the video feature extractor, incurring substantial computational costs. Only a few methods use 2D backbones for video feature extraction, and they suffer from diminished accuracy due to the inherent lack of temporal information within 2D features. To address this problem, we propose a novel 2D video grounding method called TFA that improves accuracy while minimizing computational costs. Our approach involves a query-guided temporal feature aggregation module designed to explicitly capture temporal information. We disentangle time intervals of input video frames and prediction spans to reduce computational overhead. Additionally, we introduce deformable attention into the multi-modal encoder for further enhancement. Extensive experiments on two public datasets demonstrate that our method outperforms previous 2D video grounding methods and achieves competitive results with most 3D methods at significantly reduced costs.
What problem does this paper attempt to address?