Tri-Modal Dense Video Captioning Based on Fine-Grained Aligned Text and Anchor-Free Event Proposals Generator
Jingjing Niu,Yulai Xie,Yang Zhang,Jinyu Zhang,Yanfei Zhang,Xiao Lei,Fang Ren
DOI: https://doi.org/10.1142/s021800142255014x
IF: 1.261
2022-01-01
International Journal of Pattern Recognition and Artificial Intelligence
Abstract:Multi-modal dense video captioning is a task using multiple information to detect all meaningful events and generate a textual description for each event. The existing works mainly rely on single visual or dual audio-visual modals in dense video captioning, while completely ignoring the text modal (subtitle). The text modal has a similar data structure as the video captions, which provides immediate semantic information to the content description for a video. In this paper, we propose a novel framework, called Two-Stage Cross-Modal Encoding Transformer Network (TS-CMETN), to realize the multi-modal dense video captioning task by fusing multiple features, including audio, visual, and text. First, we design a two-stage feature fusion encoder that hierarchically achieves the intra- and inter-modal information interaction. Second, we propose an anchor-free temporal event proposal module, which efficiently generates event proposals at each time step without the complex anchor calculation. Extensive experiments on the ActivityNet Captions dataset show that our proposed framework achieves high performance. Moreover, our approach can adaptively handle cases of the missing text modal. Our code and data are available at https://github.com/xieyulai/TM-CMETN.