Abstract:Abstract Nowadays, Artificial Intelligence Generated Content (AIGC) has shown promising prospects in both computer vision and natural language processing communities. Meanwhile, as an essential aspect of AIGC, image to captions has received much more attention. Recent vision-language research is developing from the bulky region visual representations based on object detectors toward more convenient and flexible grid ones. However, this kind of research typically concentrates on image understanding tasks like image classification, with less attention paid to content generation tasks. In this paper, we explore how to capitalize on the expressive features embedded in the grid visual representations for better image captioning. To this end, we present a Transformer-based image captioning model, dubbed FeiM, with two straightforward yet effective designs. We first design the feature queries that consist of a limited set of learnable vectors, which act as the local signals to capture specific visual information from global grid features. Then, taking augmented global grid features and the local feature queries as inputs, we develop a feature interaction module to query relevant visual concepts from grid features, and to enable interaction between the local signal and overall context. Finally, the refined grid visual representations and the linguistic features pass through a Transformer architecture for multi-modal fusion. With the two novel and simple designs, FeiM can fully leverage meaningful visual knowledge to improve image captioning performance. Extensive experiments are performed on the competitive MSCOCO benchmark to confirm the effectiveness of the proposed approach, and the results show that FeiM yields more eminent results than existing advanced captioning models.

Stay in Grid: Improving Video Captioning Via Fully Grid-Level Representation.

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Exploring better image captioning with grid features

Motion Guided Spatial Attention for Video Captioning.

Discriminative Latent Semantic Graph for Video Captioning

GL-RG: Global-Local Representation Granularity for Video Captioning

Video Captioning Using Global-Local Representation

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Motion Guided Region Message Passing for Video Captioning

QAVidCap: Enhancing Video Captioning Through Question Answering Techniques

Non-Autoregressive Coarse-to-Fine Video Captioning

Learning Video-Text Aligned Representations for Video Captioning

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Transforming Visual Scene Graphs to Image Captions

Semantic-Driven Saliency-Context Separation for Video Captioning

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

Center-enhanced video captioning model with multimodal semantic alignment

Exploiting Auxiliary Caption for Video Grounding

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning