Abstract:Video captioning, which aims to automatically generate video captions, has gained significant attention due to its wide range of applications in video surveillance and retrieval. However, most existing methods focus on frame-level convolution to extract features, which ignores the semantic relationships between objects, resulting in the inability to encode video details. To address this problem, inspired by human cognitive processes towards the world, we propose a video captioning method based on semantic disambiguation through structured encoding. First, the conceptual semantic graph of a video is constructed by introducing a knowledge graph. Then, the graph convolution networks are used for relational learning of the conceptual semantic graph to mine the semantic relationships of objects and form the detail encoding of video. Aiming to address the semantic ambiguity of multiple relationships between objects, we propose a method to dynamically learn the most relevant relationships using video scene semantics to construct semantic graphs based on semantic disambiguation. Finally, we propose a cross-domain guided relationship learning strategy to avoid the negative impact caused by using only captions as cross-entropy loss. Experiments based on three datasets—MSR-VTT, ActivityNet Captions, and Student Classroom Behavior—showed that our method outperforms other methods. The results show that introducing a knowledge graph for common sense reasoning of objects in videos can deeply encode the semantic relationships between objects to capture video details and improve captioning performance.

Dense Video Captioning for Incomplete Videos

Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols

Dense Video Object Captioning from Disjoint Supervision

Streaming Dense Video Captioning

Weakly Supervised Dense Video Captioning

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Video Captioning with Transferred Semantic Attributes.

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Video Paragraph Captioning As a Text Summarization Task

Semantic-Driven Saliency-Context Separation for Video Captioning

Discriminative Latent Semantic Graph for Video Captioning

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

SnapCap: Efficient Snapshot Compressive Video Captioning

Structured Encoding Based on Semantic Disambiguation for Video Captioning

CLIP4Caption: CLIP for Video Caption

OSVidCap: A Framework for the Simultaneous Recognition and Description of Concurrent Actions in Videos in an Open-Set Scenario

Video Captioning Using Weak Annotation

Learning Video-Text Aligned Representations for Video Captioning

Weakly Supervised Dense Event Captioning in Videos.

Live Video Captioning

Comprehensive Visual Grounding for Video Description