Boosting Semi-Supervised Video Captioning via Learning Candidates Adjusters

Wanru Xu,Zhenjiang Miao,Jian Yu,Yigang Cen,Yi Tian,Lili Wan,Yanli Wan,Qiang Ji
DOI: https://doi.org/10.1145/3652838
2024-07-11
Abstract:Video captioning is a multimodal task on both CV and NLP, whose goal is to automatically obtain the description of video content with natural language statements. Although there are amounts of video data, their annotations with description sentences are very limited. In this paper, we define the semi-supervised video captioning (SSVC) problem in order to improve performance with limited annotations by leveraging the semantic knowledge from both well-annotated samples and no-annotated samples. To address the problem, we introduce a LCA-boosted model (LCABM) for boosting SSVC, where it is first to explore a learnable candidates adjuster to adjust the caption candidates and then treat these adjusted captions as pesudo labels to train the SSVC model with no-annotated samples in reverse. In particular, the model learning is considered as a bi-level optimization problem and solved by an EM-like multi-stage training algorithm. The experiments show the effectiveness of our proposed LCABM, whose performance is comparable and even better than those state-of-the-art fully-supervised methods even with less annotations.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?