Abstract:Image captioning has been an emerging and fast-developing research topic. Nevertheless, most existing works heavily rely on large amounts of image-sentence pairs and therefore hinder the practical applications of captioning in the wild. In this paper, we present a novel Semantic-Constrained Self-learning (SCS) framework that explores an iterative self-learning strategy to learn an image captioner with only unpaired image and text data. Technically, SCS consists of two stages, i.e., pseudo pair generation and captioner re-training, iteratively producing "pseudo" image-sentence pairs via a pre-trained captioner and re-training the captioner with the pseudo pairs, respectively. Particularly, both stages are guided by the recognized objects in the image, that act as semantic constraint to strengthen the semantic alignment between the input image and the output sentence. We leverage a semantic-constrained beam search for pseudo pair generation to regularize the decoding process with the recognized objects via forcing the inclusion/exclusion of the recognized/irrelevant objects in output sentence. For captioner re-training, a self-supervised triplet loss is utilized to preserve the relative semantic similarity ordering among generated sentences with regard to the input image triplets. Moreover, an object inclusion reward and an adversarial reward are adopted to encourage the inclusion of the predicted objects in the output sentence and pursue the generation of more realistic sentences during self-critical training, respectively. Experiments conducted on both dependent and independent unpaired data validate the superiority of SCS. More remarkably, we obtain the best published CIDEr score to-date of 74.7% on COCO Karpathy test split for unpaired image captioning.

Boosting Semi-Supervised Video Captioning via Learning Candidates Adjusters

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Semi-Supervised Learning for Video Captioning.

Adaptive Curriculum Learning for Video Captioning.

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Structural Semantic Adversarial Active Learning for Image Captioning

Semantic-Driven Saliency-Context Separation for Video Captioning

Weakly Supervised Dense Video Captioning

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Video Captioning With Attention-Based LSTM and Semantic Consistency

Learning Video-Text Aligned Representations for Video Captioning

Measuring apoptosis in neural stem cells.

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Local-to-Global Semantic Supervised Learning for Image Captioning

Unpaired Image Captioning With semantic-Constrained Self-Learning

Discriminative Latent Semantic Graph for Video Captioning

Learning Multimodal Attention LSTM Networks for Video Captioning.

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

Multimodality-guided Visual-Caption Semantic Enhancement

Boosting convolutional image captioning with semantic content and visual relationship

Multi-level video captioning method based on semantic space