Abstract:Attention mechanisms have been extensively adopted in vision and language tasks such as image captioning. It encourages a captioning model to dynamically ground appropriate image regions when generating words or phrases, and it is critical to alleviate the problems of object hallucinations and language bias. However, current studies show that the grounding accuracy of existing captioners is still far from satisfactory. Recently, much effort is devoted to improving the grounding accuracy by linking the words to the full content of objects in images. However, due to the noisy grounding annotations and large variations of object appearance, such strict word-object alignment regularization may not be optimal for improving captioning performance. In this paper, to improve the performance of both grounding and captioning, we propose a novel grounding model which implicitly links the words to the evidence in the image. The proposed model encourages the captioner to dynamically focus on informative regions of the objects, which could be either discriminative parts or full object content. With slacked constraints, the proposed captioning model can capture correct linguistic characteristics and visual relevance, and then generate more grounded image captions. In addition, we propose a novel quantitative metric for evaluating the correctness of the soft attention mechanism by considering the overall contribution of all object proposals when generating certain words. The proposed grounding model can be seamlessly plugged into most attention-based architectures without introducing inference complexity. We conduct extensive experiments on Flickr30k (Young et al., 2014) and MS COCO datasets (Lin et al., 2014), demonstrating that the proposed method consistently improves image captioning in both grounding and captioning. Besides, the proposed attention evaluation metric shows better consistency with the captioning performance.

Learning Comprehensive Visual Grounding for Video Captioning

Comprehensive Visual Grounding for Video Description

Visual-Semantic Graph Matching for Visual Grounding

Exploiting Auxiliary Caption for Video Grounding

Visual Cluster Grounding for Image Captioning

Grounded Video Description

Grounded Video Caption Generation

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

Video-Guided Curriculum Learning for Spoken Video Grounding

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Grounded Video Situation Recognition

Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding.

Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

SnAG: Scalable and Accurate Video Grounding

Top-down framework for weakly-supervised grounded image captioning

Cycle-Consistency Learning for Captioning and Grounding

Generalizable Entity Grounding via Assistance of Large Language Model

End-to-end Multi-modal Video Temporal Grounding