Abstract:The task of image captioning aims to generate captions directly from images via the automatically learned cross-modal generator. To build a well-performing generator, existing approaches usually need a large number of described images (i.e., supervised image-sentence pairs), requiring a huge effects on manual labeling. However, in real-world applications, a more general scenario is that we only have limited amount of described images and a large number of undescribed images. Therefore, a resulting challenge is how to effectively combine the undescribed images into the learning of cross-modal generator (i.e., semisupervised image captioning). To solve this problem, we propose a novel image captioning method by exploiting the cross-modal prediction and relation consistency (CPRC), which aims to utilize the raw image input to constrain the generated sentence in the semantic space. In detail, considering that the heterogeneous gap between modalities always leads to the supervision difficulty while using the global embedding directly, CPRC turns to transform both the raw image and corresponding generated sentence into the shared semantic space, and measure the generated sentence from two aspects: 1) prediction consistency: CPRC utilizes the prediction of raw image as soft label to distill useful supervision for the generated sentence, rather than employing the traditional pseudo labeling and 2) relation consistency: CPRC develops a novel relation consistency between augmented images and corresponding generated sentences to retain the important relational knowledge. In result, CPRC supervises the generated sentence from both the informativeness and representativeness perspectives, and can reasonably use the undescribed images to learn a more effective generator under the semisupervised scenario. The experiments show that our method outperforms state-of-the-art comparison methods on the MS-COCO "Karpathy" offline test split under complex nonparallel scenarios, for example, CPRC achieves at least 6 % improvements on the CIDEr-D score.

Pseudo Content Hallucination for Unpaired Image Captioning

Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition

Unpaired Image Captioning With semantic-Constrained Self-Learning

Prompt-Based Learning for Unpaired Image Captioning

Mitigating Open-Vocabulary Caption Hallucinations

Improving Image Captioning with Better Use of Caption

See or Guess: Counterfactually Regularized Image Captioning

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

ALOHa: A New Measure for Hallucination in Captioning Models

Improving Image Captioning with Better Use of Captions

Mining core information by evaluating semantic importance for unpaired image captioning

MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning

Image Captioning with Partially Rewarded Imitation Learning.

Zero-Shot Image Caption Inference System Based on Pretrained Models

Towards Unique and Informative Captioning of Images

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?

Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning

IC3: Image Captioning by Committee Consensus

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Unpaired Image Captioning by Language Pivoting

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites