Semantic-Driven Saliency-Context Separation for Video Captioning

Heming Jing,Yuejie Zhang,Rui Feng,Rui-Wei Zhao,Tao Zhang,Xuequan Lu,Shang Gao
DOI: https://doi.org/10.1109/icme52920.2022.9859690
2022-01-01
Abstract:Video captioning aims at generating a natural language de-scription for a given video clip including not only salient sce-narios but also contextual scenarios. The former reveal the highlight of a video and are usually the focus of most existing captioning methods. The latter, however, are not well ex-plored and even ignored easily, though they may provide cer-tain detailed and latent information that can help with a better understanding of the video. To effectively exploit the infor-mation contained in both, a novel video captioning network is proposed. It has two key modules: Cross-Modality Selection (CMS) and Saliency-Context Adaptive Decoder (SCAD). Specifically, CMS mainly focuses on utilizing the semantic information to distinguish saliency and context. Meanwhile, SCAD adaptively identifies both the saliency and context to generate more detailed and precise captions. Experiments on two benchmark datasets, i.e., MSVD and MSR-VTT, demon-strate the effectiveness of our model through the comparison with state-of-the-art methods.
What problem does this paper attempt to address?