Memory-Augmented Transformer for Efficient End-to-End Video Grounding

Yuanwu Xu,Mohan Chen,Yuejie Zhang,Rui Feng,Tao Zhang,Shang Gao
DOI: https://doi.org/10.1109/icme57554.2024.10687697
2024-01-01
Abstract:Video grounding aims to localize a specific segment corresponding to a text query in an untrimmed video. Due to the tremendous computational cost required to process the video frames, the de facto paradigm of video grounding is to extract video features using pretrained video encoders. The parameters of the video encoders are fixed during training, which limits the performance of the localization model. To solve this problem, we propose a Memory-Augmented Transformer (MAT) model. Specifically, each video is split into non-overlapping clips, and our MAT processes videos in a clip-by-clip manner while caching video features into FIFO cached memory queues. By enabling early return, our MAT outperforms previous methods with only less than 60% frames seen. Extensive experimental results on three public benchmark datasets demonstrate that our MAT can achieve competitive performance while being much more efficient than currently prevailing two-stage methods. Code is available at https://github.com/xuyw1997/MAT.
What problem does this paper attempt to address?