Learning Comprehensive Visual Grounding for Video Captioning

Wenhui Jiang,Linxin Liu,Yuming Fang,Yibo Cheng,Yuxin Peng,Yang Liu
DOI: https://doi.org/10.1109/tcsvt.2024.3502621
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations. However, grounded captioning models rely on deliberate grounding annotations as supervision, which are relatively hard to obtain. Moreover, the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus, and these models seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by using inexpensive pseudo annotation while avoiding the need to collect large amounts of manual annotations. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision. More importantly, the grounding objective is supervised by pseudo annotations automatically produced by a grounding annotation generation module, thus our model can be easily applied to the challenging dataset without any grounding annotation provided. We conduct extensive experiments on three benchmark datasets and demonstrate significant performance improvements of +2.4 CIDEr on MSR-VTT, +4.7 CIDEr on MSVD, and +5.1 CIDEr on ActivityNet-Entities compared to state-of-the-arts.
What problem does this paper attempt to address?