Abstract:The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations. However, grounded captioning models rely on deliberate grounding annotations as supervision, which are relatively hard to obtain. Moreover, the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus, and these models seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by using inexpensive pseudo annotation while avoiding the need to collect large amounts of manual annotations. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision. More importantly, the grounding objective is supervised by pseudo annotations automatically produced by a grounding annotation generation module, thus our model can be easily applied to the challenging dataset without any grounding annotation provided. We conduct extensive experiments on three benchmark datasets and demonstrate significant performance improvements of +2.4 CIDEr on MSR-VTT, +4.7 CIDEr on MSVD, and +5.1 CIDEr on ActivityNet-Entities compared to state-of-the-arts.

Exploring Collaborative Caption Editing to Augment Video-Based Learning.

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Edit As You Wish: Video Caption Editing with Multi-grained User Control

Multimodality-guided Visual-Caption Semantic Enhancement

Explicit Image Caption Editing

Adaptive Curriculum Learning for Video Captioning.

MoS 2 : Mixture of Scale and Shift Experts for Text-Only Video Captioning

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Object Relational Graph with Teacher-Recommended Learning for Video Captioning

Discriminative Latent Semantic Graph for Video Captioning

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Seeing and Hearing Too: Audio Representation for Video Captioning.

Collaborative Detection and Caption Network

Consensus-Guided Keyword Targeting for Video Captioning.

Learning Comprehensive Visual Grounding for Video Captioning

Utilizing Text-based Augmentation to Enhance Video Captioning

Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models

Cap4Video++: Enhancing Video Understanding with Auxiliary Captions

Learning Video-Text Aligned Representations for Video Captioning

Exploring Group Video Captioning with Efficient Relational Approximation

Structured Encoding Based on Semantic Disambiguation for Video Captioning