Improving Generalization of Image Captioning with Unsupervised Prompt Learning

Hongchen Wei,Zhenzhong Chen
2023-08-05
Abstract:Pretrained visual-language models have demonstrated impressive zero-shot abilities in image captioning, when accompanied by hand-crafted prompts. Meanwhile, hand-crafted prompts utilize human prior knowledge to guide the model. However, due to the diversity between different domains, such hand-crafted prompt that provide invariant prior knowledge may result in mode collapse for some domains. Some researches attempted to incorporate expert knowledge and instruction datasets, but the results were costly and led to hallucinations. In this paper, we propose an unsupervised prompt learning method to improve Generalization of Image Captioning (GeneIC), which learns a domain-specific prompt vector for the target domain without requiring annotated data. GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model, thus optimizing the domain-specific prompt vector from two aspects: attribute and semantic consistency. Specifically, GeneIC first generates attribute-transferred images with differing attributes, while retaining semantic similarity with original images. Then, GeneIC uses CLIP to measure the similarity between the images and the generated sentences. By exploring the variable and invariant features in the original images and attribute-transferred images, attribute consistency constrains the attribute change direction of both images and sentences to learn domain-specific knowledge. The semantic consistency directly measures the similarity between the generated sentences and images to ensure the accuracy and comprehensiveness of the generated sentences. Consequently, GeneIC only optimizes the prompt vectors, which effectively retains the knowledge in the large model and introduces domain-specific knowledge.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the **generalization problem** in image captioning, particularly in applications across different target domains. Specifically: 1. **Limitations of Handwritten Prompts**: Traditional visual language models rely on handwritten prompts (e.g., "A photo of...") for image captioning. While this approach performs well in the source domain, it tends to lead to mode collapse in the target domain (i.e., the generated descriptions lack diversity and are of low quality). This is because handwritten prompts cannot adaptively capture domain-specific knowledge. 2. **Automatic Prompt Learning**: To overcome this limitation, the paper proposes an unsupervised prompt learning method (GeneIC) that can learn domain-specific prompt vectors without labeled data. This method optimizes the prompt vectors by exploring the variable and invariant features within the target domain, thereby guiding the model to generate high-quality descriptions that include domain-specific knowledge. 3. **Attribute Consistency and Semantic Consistency**: To achieve this, GeneIC leverages a pre-trained CLIP model to align visual and language modalities and optimizes the prompt vectors from two aspects: attribute consistency and semantic consistency. Attribute consistency ensures that the direction of attribute changes between images and sentences is consistent, while semantic consistency ensures that the generated sentences are semantically consistent with the input images. Through the above methods, the paper addresses the insufficient generalization capability of existing methods across different domains and demonstrates superior generalization performance on multiple target domain datasets compared to the state-of-the-art techniques.