Abstract:Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Multi-scale features with temporal information guidance for video captioning

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

EVC-MF: End-to-end Video Captioning Network with Multi-scale Features

Delving Deeper into the Decoder for Video Captioning

PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

CDKM: Common and Distinct Knowledge Mining Network with Content Interaction for Dense Captioning

Context and Attribute Grounded Dense Captioning

Dual-level Collaborative Transformer for Image Captioning

Region-Focused Network for Dense Captioning

Show, tell and rectify: Boost image caption generation via an output rectifier

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Divided Caption Model with Global Attention

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Exploring Discrete Diffusion Models for Image Captioning

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Progressive Scale-aware Network for Remote sensing Image Change Captioning