Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning

Ping Li,Tao Wang,Xinkui Zhao,Xianghua Xu,Mingli Song
2024-11-07
Abstract:Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (\eg, 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (\ie, edit words), the former module guides the model to edit words using some actions (\eg, copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. The code implementation is available at <a class="link-external link-https" href="https://github.com/mlvccn/PKG_VidCap" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is, in the task of video caption generation, how to train a model capable of generating high - quality descriptions with only a small amount of supervision information (for example, only one or two annotated sentences for each video). Specifically, existing methods usually require a large amount of annotated data (such as 10 to 20 annotated sentences for each video) to train the model, which results in high labor costs. Therefore, this paper proposes a new task - few - supervised video captioning, aiming to use a very small amount of annotated data (such as only one annotated sentence for each video) to train the model and achieve performance as close as possible to that of fully - supervised methods. ### Main Problems and Challenges 1. **How to Expand Sentences to Generate Pseudo - Labels**: Since there is only a small amount of annotated data, how to generate more training samples through effective data augmentation techniques is a key issue. 2. **How to Ensure the Quality of Generated Pseudo - Label Sentences**: The generated pseudo - label sentences must be consistent with the video content and have high semantic quality to ensure the effectiveness of model training. ### Overview of Solutions To solve the above problems, the author proposes a method named Pseudo - labeling with Keyword Refining for Few - Supervised Video Captioning (PKG), which includes two main modules: 1. **Lexically Constrained Pseudo - Labeling Module**: - Edit the original sentences through pre - trained language models and action classifiers (such as copy, replace, insert, delete, etc. operations) to generate diverse pseudo - label sentences. - Use the repetition - penalty sampling strategy to reduce repetition in the generated sentences. - Utilize pre - trained video - text matching models (such as X - CLIP) to select the pseudo - label sentences most relevant to the video content. 2. **Keyword - Refined Captioning Module**: - Design a Transformer - based keyword refining mechanism, adjust the weights of keywords in the pseudo - label sentences through the attention mechanism to make them more in line with the video content. - Introduce a semantic loss function between keywords to ensure the semantic consistency between the pseudo - label sentences and the manually - annotated sentences. ### Experimental Results Experiments show that on multiple public datasets (such as MSVD, MSR - VTT, VATEX), this method performs excellently in the few - supervised scenario, and even exceeds the performance of fully - supervised methods in some cases. In addition, this method also outperforms the existing state - of - the - art supervised methods when using all annotated data. ### Formula Presentation The following are some formulas involved in the paper, presented in Markdown format: - **Calculate the Probabilities of Forward Language Model and Backward Language Model**: \[ \hat{p}_{FLM} = \frac{p_{FLM}(\hat{Y}_{1 \rightarrow (t - 1)}|\hat{y}_t = y_{voc}^j)}{\sum_{j = 1}^{N_{voc}}p_{FLM}(\hat{Y}_{1 \rightarrow (t - 1)}|\hat{y}_t = y_{voc}^j)\cdot I(y_{voc}^j\in\hat{Y})} \] \[ \hat{p}_{BLM} = \frac{p_{BLM}(\hat{Y}_{lens\rightarrow (t + 1)}|\hat{y}_t = y_{voc}^j)}{\sum_{j = 1}^{N_{voc}}p_{BLM}(\hat{Y}_{lens\rightarrow (t + 1)}|\hat{y}_t = y_{voc}^j)\cdot I(y_{voc}^j\in\hat{Y})} \] - **Final Replacement Probability**: