Abstract:Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, utilizing independent proposals produced by object detectors tends to make the subsequent grounded captioner overfitted in finding the correct object words, overlooking the relation between objects, and selecting incompatible proposal regions for grounding. To address these issues, we propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. Specifically, we encode the image into visual token representations and propose a Recurrent Grounding Module (RGM) in the decoder to obtain precise Visual Language Attention Maps (VLAMs), which recognize the spatial locations of the objects. In addition, we explicitly inject a relation module into our one-stage framework to encourage relation understanding through multi-label classification. This relation semantics served as contextual information facilitating the prediction of relation and object words in the caption. We observe that the relation semantic not only assists the grounded captioner in generating a more accurate caption but also improves the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance. We will make the code publicly available.

Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations

Visual-Semantic Graph Matching for Visual Grounding

Cycle-Consistency Learning for Captioning and Grounding

Top-down framework for weakly-supervised grounded image captioning

Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation.

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding.

Weakly Supervised Attention Learning for Textual Phrases Grounding

vtGraphNet: Learning weakly-supervised scene graph for complex visual grounding

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Weakly-Supervised Video Object Grounding via Causal Intervention

Weakly-Supervised Spoken Video Grounding Via Semantic Interaction Learning.

Language Adaptive Weight Generation for Multi-task Visual Grounding

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Visual Grounding With Joint Multimodal Representation and Interaction

Joint Visual Grounding with Language Scene Graphs

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Cross-Modal Match for Language Conditioned 3D Object Grounding

A Dual Reinforcement Learning Framework for Weakly Supervised Phrase Grounding

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning