DCT-net: A Deep Co-Interactive Transformer Network for Video Temporal Grounding

Wen Wang,Jian Cheng,Siyu Liu
DOI: https://doi.org/10.1016/j.imavis.2021.104183
IF: 3.86
2021-01-01
Image and Vision Computing
Abstract:Language-guided video temporal grounding is to temporally localize the best matched video segment in an untrimmed long video according to a given natural language query (sentence). The main challenge in this task lies in how to fuse visual and linguistic information effectively. Recent works have shown that the attention mechanism is beneficial to the multi-modal feature fusion process. In this paper, we present a concise yet valid Deep Co-Interactive Transformer Network (DCT-Net) which repurposes a Transformer-style architecture to sufficiently model cross modality interactions. It consists of Co-Interactive Transformer (CIT) layers cascaded in depth for multi-step interactions between a video-sentence pair. With the help of the proposed CIT layer, both visual and language features can share the mutually improved benefits from each other. Extensive experiments on two public datasets, i.e. ActivityNet-Caption and TACOS, demonstrate the effectiveness of our proposed model compared to state-of-the-art methods.
What problem does this paper attempt to address?