Semantic-Guided Network with Contrastive Learning for Video Caption

Kaixuan Chen,Qianji Di,Yang Lu,Hanzi Wang
DOI: https://doi.org/10.1109/icassp48485.2024.10447433
2024-01-01
Abstract:Video captioning is a challenging task, which aims at generating a sentence to describe the content of a video using the natural language. Many existing methods model visual features (2D/3D) extracted from videos to generate captions, but they neglect semantic guidance. Empirically, visual features contain fine-grained information such as color and shape, while the generation of captions requires more emphasis on the semantic and syntax clues that cannot be studied adequately only from the caption loss. To alleviate this problem, we propose a semantic-guided network based on contrastive learning (SNCL), which makes use of both vision and text information to enrich the features with contextual guidance. Based on the meaningful features, hierarchical reasoning modules are employed to perform the key phrase prediction task in order to enhance our model with specific semantic guidance. Experimental results on the MSVD and MSR-VTT datasets show that our SNCL outperforms recent state-of-the-art methods.
What problem does this paper attempt to address?