Abstract:Abstract Nowadays, Artificial Intelligence Generated Content (AIGC) has shown promising prospects in both computer vision and natural language processing communities. Meanwhile, as an essential aspect of AIGC, image to captions has received much more attention. Recent vision-language research is developing from the bulky region visual representations based on object detectors toward more convenient and flexible grid ones. However, this kind of research typically concentrates on image understanding tasks like image classification, with less attention paid to content generation tasks. In this paper, we explore how to capitalize on the expressive features embedded in the grid visual representations for better image captioning. To this end, we present a Transformer-based image captioning model, dubbed FeiM, with two straightforward yet effective designs. We first design the feature queries that consist of a limited set of learnable vectors, which act as the local signals to capture specific visual information from global grid features. Then, taking augmented global grid features and the local feature queries as inputs, we develop a feature interaction module to query relevant visual concepts from grid features, and to enable interaction between the local signal and overall context. Finally, the refined grid visual representations and the linguistic features pass through a Transformer architecture for multi-modal fusion. With the two novel and simple designs, FeiM can fully leverage meaningful visual knowledge to improve image captioning performance. Extensive experiments are performed on the competitive MSCOCO benchmark to confirm the effectiveness of the proposed approach, and the results show that FeiM yields more eminent results than existing advanced captioning models.

Exploring the Impact of Vision Features in News Image Captioning

Transform and Tell: Entity-Aware News Image Captioning

Visuals to Text: A Comprehensive Review on Automatic Image Captioning

Image Captioning in news report scenario

Improving Image Captioning with Better Use of Captions

Improving Image Captioning with Better Use of Caption

Generating news image captions with semantic discourse extraction and contrastive style-coherent learning

Exploring Explicit and Implicit Visual Relationships for Image Captioning

Image Captioning using Facial Expression and Attention

Visually-Aware Context Modeling for News Image Captioning

Aligning Where to See and What to Tell: Image Caption with Region-Based Attention and Scene Factorization

Predicting Winning Captions for Weekly New Yorker Comics

ICECAP: Information Concentrated Entity-aware Image Captioning

Exploring better image captioning with grid features

Towards Unique and Informative Captioning of Images

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Image-relevant Entities Knowledge aware News Image Captioning

Automatic Caption Generation for News Images

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Video Summarization: Towards Entity-Aware Captions

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning