Temporal-enhanced Cross-modality Fusion Network for Video Sentence Grounding.

Zezhong Lv,Bing Su
DOI: https://doi.org/10.1109/icme55011.2023.00257
2023-01-01
Abstract:Video sentence grounding aims to localize a segment semantically aligning to the given language query from a video. Most existing works simply interact video and query only once at a single early stage. Not only multi-level dependencies within videos are not explored since interactions act fixedly on a specific level, but also the guiding role of the query is neglected. To tackle these issues, we propose an efficient network namely Temporal-enhanced Cross-modality Fusion Network (TCFN). By directly modulating the temporal receptive field, TCFN captures multi-level temporal enhanced visual features effectively. Furthermore, TCFN explicitly exploits the query to interact with the temporal-enhanced features in multiple stages for better alignment. Benefiting from its succinct architecture, TCFN achieves competitive performance compared to state-of-the-art with a much lower computational cost. Experiments on three benchmark datasets verify the effectiveness of the proposed TCFN.
What problem does this paper attempt to address?