Abstract:Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (\eg, 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (\ie, edit words), the former module guides the model to edit words using some actions (\eg, copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios. The code implementation is available at <a class="link-external link-https" href="https://github.com/mlvccn/PKG_VidCap" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is, in the task of video caption generation, how to train a model capable of generating high - quality descriptions with only a small amount of supervision information (for example, only one or two annotated sentences for each video). Specifically, existing methods usually require a large amount of annotated data (such as 10 to 20 annotated sentences for each video) to train the model, which results in high labor costs. Therefore, this paper proposes a new task - few - supervised video captioning, aiming to use a very small amount of annotated data (such as only one annotated sentence for each video) to train the model and achieve performance as close as possible to that of fully - supervised methods. ### Main Problems and Challenges 1. **How to Expand Sentences to Generate Pseudo - Labels**: Since there is only a small amount of annotated data, how to generate more training samples through effective data augmentation techniques is a key issue. 2. **How to Ensure the Quality of Generated Pseudo - Label Sentences**: The generated pseudo - label sentences must be consistent with the video content and have high semantic quality to ensure the effectiveness of model training. ### Overview of Solutions To solve the above problems, the author proposes a method named Pseudo - labeling with Keyword Refining for Few - Supervised Video Captioning (PKG), which includes two main modules: 1. **Lexically Constrained Pseudo - Labeling Module**: - Edit the original sentences through pre - trained language models and action classifiers (such as copy, replace, insert, delete, etc. operations) to generate diverse pseudo - label sentences. - Use the repetition - penalty sampling strategy to reduce repetition in the generated sentences. - Utilize pre - trained video - text matching models (such as X - CLIP) to select the pseudo - label sentences most relevant to the video content. 2. **Keyword - Refined Captioning Module**: - Design a Transformer - based keyword refining mechanism, adjust the weights of keywords in the pseudo - label sentences through the attention mechanism to make them more in line with the video content. - Introduce a semantic loss function between keywords to ensure the semantic consistency between the pseudo - label sentences and the manually - annotated sentences. ### Experimental Results Experiments show that on multiple public datasets (such as MSVD, MSR - VTT, VATEX), this method performs excellently in the few - supervised scenario, and even exceeds the performance of fully - supervised methods in some cases. In addition, this method also outperforms the existing state - of - the - art supervised methods when using all annotated data. ### Formula Presentation The following are some formulas involved in the paper, presented in Markdown format: - **Calculate the Probabilities of Forward Language Model and Backward Language Model**: \[ \hat{p}_{FLM} = \frac{p_{FLM}(\hat{Y}_{1 \rightarrow (t - 1)}|\hat{y}_t = y_{voc}^j)}{\sum_{j = 1}^{N_{voc}}p_{FLM}(\hat{Y}_{1 \rightarrow (t - 1)}|\hat{y}_t = y_{voc}^j)\cdot I(y_{voc}^j\in\hat{Y})} \] \[ \hat{p}_{BLM} = \frac{p_{BLM}(\hat{Y}_{lens\rightarrow (t + 1)}|\hat{y}_t = y_{voc}^j)}{\sum_{j = 1}^{N_{voc}}p_{BLM}(\hat{Y}_{lens\rightarrow (t + 1)}|\hat{y}_t = y_{voc}^j)\cdot I(y_{voc}^j\in\hat{Y})} \] - **Final Replacement Probability**:

Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Weakly Supervised Dense Video Captioning

MoS 2 : Mixture of Scale and Shift Experts for Text-Only Video Captioning

A Video Captioning Method by Semantic Topic-Guided Generation

Non-Autoregressive Coarse-to-Fine Video Captioning

Self Attention Re-encoding and Linguistic Ability Preserving for Context-Aware Video Captioning

Semi-Supervised Learning for Video Captioning.

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Query-based Video Summarization with Pseudo Label Supervision

Consensus-Guided Keyword Targeting for Video Captioning.

Discriminative Latent Semantic Graph for Video Captioning

Unsupervised Video Moment Retrieval with Knowledge-based Pseudo Supervision Construction

Learning Video-Text Aligned Representations for Video Captioning

Video Captioning Using Weak Annotation

Video Captioning With Attention-Based LSTM and Semantic Consistency

Towards accurate unsupervised video captioning with implicit visual feature injection and explicit

Concept Parser with Multimodal Graph Learning for Video Captioning

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Semantic-Driven Saliency-Context Separation for Video Captioning

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval