Abstract:With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model -- CAVP and its subsequent language policy network -- can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

Explore and Tell: Embodied Visual Captioning in 3D Environments

Self-Explainable Affordance Learning with Embodied Caption

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Dense captioning and multidimensional evaluations for indoor robotic scenes

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Scalable 3D Captioning with Pretrained Models

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

End-to-End 3D Dense Captioning with Vote2Cap-DETR

An Exploration of Embodied Visual Exploration

View Selection for 3D Captioning via Diffusion Ranking

Bi-directional Contextual Attention for 3D Dense Captioning

ShapeCaptioner: Generative Caption Network for 3D Shapes by Learning a Mapping from Parts Detected in Multiple Views to Sentences

Embodied Scene Description

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Exploring Object-Centered External Knowledge for Fine-Grained Video Paragraph Captioning