Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning

Yulai Xie,Jingjing Niu,Yang Zhang,Fang Ren
DOI: https://doi.org/10.1109/tmm.2023.3307972
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Dense video captioning aims to detect all events of an uncropped video and generate corresponding textual captions for each event. Multi-modal information is essential to improve the performance of this task, but the existing methods mainly rely on the single visual or dual audio-visual modal input, while completely ignoring the text modal input (subtitle). Since the text data has a similar data representation as video caption words, it is conducive to the performance improvement of video captioning. In this paper, we propose a novel framework, called the multi-stage fusion transformer network (MS-FTN), to realize multi-modal dense video captioning by fusing the text, the audio, and the visual features in stages. We present a multi-stage feature fusion encoder that first fuses audio and visual modalities at a lower level and then fuses them with a global-shared text representation at a higher level to generate a set of multi-modal complementary context features. In addition, an anchor-free event proposal module is proposed to efficiently generate a set of event proposals without the complex anchor calculation. Extensive experiments on the subsets of the ActivityNet Captions dataset show that our proposed MS-FTN achieves superior performance and efficient computation. Moreover, the ablation studies demonstrate that the global-shared text representation is more suitable for multi-modal dense video captioning. Our code and data are available at https://github.com/xieyulai/GS-MS-FTN .
What problem does this paper attempt to address?