Consensus-Guided Keyword Targeting for Video Captioning.

Puzhao Ji,Bang Yang,Tong Zhang,Yuexian Zou
DOI: https://doi.org/10.1007/978-3-031-18913-5_21
2022-01-01
Abstract:Mainstream video captioning models (VCMs) are trained under fully supervised learning that relies heavily on large-scaled high-quality video-caption pairs. Unfortunately, evaluating the corpora of benchmark datasets shows that there are many defects associated with humanly labeled annotations, such as variation of the caption length and quality for one video and word imbalance in captions. Such defects may pose a significant impact on model training. In this study, we propose to lower down the adverse impact of annotations and encourage VCMs to learn high-quality captions and more informative words via Consensus-Guided Keyword Targeting (CGKT) training strategy. Specifically, CGKT firstly aims at re-weighting each training caption using a consensus-based metric named CIDEr. Secondly, CGKT attaches more weights to those informative and uncommonly used words based on their frequency. Extensive experiments on MSVD and MSR-VTT show that the proposed CGKT can easily work with three VCMs to achieve significant CIDEr improvements. Moreover, compared with the conventional cross-entropy objective, our CGKT facilitates the generation of more comprehensive and better-quality captions.
What problem does this paper attempt to address?