Improving Generalization of Image Captioning with Unsupervised Prompt Learning

Hongchen Wei,Zhenzhong Chen

2023-08-05

Abstract:Pretrained visual-language models have demonstrated impressive zero-shot abilities in image captioning, when accompanied by hand-crafted prompts. Meanwhile, hand-crafted prompts utilize human prior knowledge to guide the model. However, due to the diversity between different domains, such hand-crafted prompt that provide invariant prior knowledge may result in mode collapse for some domains. Some researches attempted to incorporate expert knowledge and instruction datasets, but the results were costly and led to hallucinations. In this paper, we propose an unsupervised prompt learning method to improve Generalization of Image Captioning (GeneIC), which learns a domain-specific prompt vector for the target domain without requiring annotated data. GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model, thus optimizing the domain-specific prompt vector from two aspects: attribute and semantic consistency. Specifically, GeneIC first generates attribute-transferred images with differing attributes, while retaining semantic similarity with original images. Then, GeneIC uses CLIP to measure the similarity between the images and the generated sentences. By exploring the variable and invariant features in the original images and attribute-transferred images, attribute consistency constrains the attribute change direction of both images and sentences to learn domain-specific knowledge. The semantic consistency directly measures the similarity between the generated sentences and images to ensure the accuracy and comprehensiveness of the generated sentences. Consequently, GeneIC only optimizes the prompt vectors, which effectively retains the knowledge in the large model and introduces domain-specific knowledge.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the **generalization problem** in image captioning, particularly in applications across different target domains. Specifically: 1. **Limitations of Handwritten Prompts**: Traditional visual language models rely on handwritten prompts (e.g., "A photo of...") for image captioning. While this approach performs well in the source domain, it tends to lead to mode collapse in the target domain (i.e., the generated descriptions lack diversity and are of low quality). This is because handwritten prompts cannot adaptively capture domain-specific knowledge. 2. **Automatic Prompt Learning**: To overcome this limitation, the paper proposes an unsupervised prompt learning method (GeneIC) that can learn domain-specific prompt vectors without labeled data. This method optimizes the prompt vectors by exploring the variable and invariant features within the target domain, thereby guiding the model to generate high-quality descriptions that include domain-specific knowledge. 3. **Attribute Consistency and Semantic Consistency**: To achieve this, GeneIC leverages a pre-trained CLIP model to align visual and language modalities and optimizes the prompt vectors from two aspects: attribute consistency and semantic consistency. Attribute consistency ensures that the direction of attribute changes between images and sentences is consistent, while semantic consistency ensures that the generated sentences are semantically consistent with the input images. Through the above methods, the paper addresses the insufficient generalization capability of existing methods across different domains and demonstrates superior generalization performance on multiple target domain datasets compared to the state-of-the-art techniques.

Improving Generalization of Image Captioning with Unsupervised Prompt Learning

Learning Combinatorial Prompts for Universal Controllable Image Captioning

Concept-Guided Prompt Learning for Generalization in Vision-Language Models

Prompt-Based Learning for Unpaired Image Captioning

Learning Domain Invariant Prompt for Vision-Language Models

Unsupervised Prompt Learning for Vision-Language Models

Learning to Prompt for Vision-Language Models

Image Captions Are Natural Prompts for Text-to-Image Models

Generalizable Prompt Tuning for Vision-Language Models

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

Improving Zero-Shot Generalization for CLIP with Synthesized Prompts

Transitive Vision-Language Prompt Learning for Domain Generalization

Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Domain-Controlled Prompt Learning

Learning to Prompt Your Domain for Vision-Language Models

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Text Data-Centric Image Captioning with Interactive Prompts

Soft Prompt Generation for Domain Generalization

PromptCap: Prompt-Guided Task-Aware Image Captioning

Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling