Abstract:With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model -- CAVP and its subsequent language policy network -- can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

Context-Aware Image Descriptions for Web Accessibility

Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs

Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering

"It's Kind of Context Dependent": Understanding Blind and Low Vision People's Video Accessibility Preferences Across Viewing Scenarios

Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People

GenAssist: Making Image Generation Accessible

WorldScribe: Towards Context-Aware Live Visual Descriptions

"Let's not Quote out of Context": Unified Vision-Language Pretraining for Context Assisted Image Captioning

Description Enhancement of Generated Images via Automatic Visual Question Generation and Answering

Latent Visual Context Learning for Web Image Applications

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

Investigating Use Cases of AI-Powered Scene Description Applications for Blind and Low Vision People

A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction

Toward accessible comics for blind and low vision readers

Context-aware captions from context-agnostic supervision

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

A New Approach to Image Enhancement for the Visually Impaired.

Exploiting Text-Image Latent Spaces for the Description of Visual Concepts

How Culturally Aware are Vision-Language Models?