Abstract:With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model -- CAVP and its subsequent language policy network -- can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

PromptCap: Prompt-Guided Task-Aware Image Captioning

Learning Combinatorial Prompts for Universal Controllable Image Captioning

Prompt-Based Learning for Unpaired Image Captioning

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

Text Data-Centric Image Captioning with Interactive Prompts

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Video Interactive Captioning with Human Prompts.

ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

CapWAP: Captioning with a Purpose

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

Improving Generalization of Image Captioning with Unsupervised Prompt Learning

CapOnImage: Context-driven Dense-Captioning on Image

Image Captioning by Asking Questions

Zero-shot Visual Question Answering with Language Model Feedback

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning