Abstract:CNN-LSTM based architectures have played an important role in image captioning, but limited by the training efficiency and expression ability, researchers began to explore the CNN-Transformer based models and achieved great success. Meanwhile, almost all recent works adopt Faster R-CNN as the backbone encoder to extract region-level features from given images. However, Faster R-CNN needs a pre-training on an additional dataset, which divides the image captioning task into two stages and limits its potential applications. In this paper, we build a pure Transformer-based model, which integrates image captioning into one stage and realizes end-to-end training. Firstly, we adopt SwinTransformer to replace Faster R-CNN as the backbone encoder to extract grid-level features from given images; Then, referring to Transformer, we build a refining encoder and a decoder. The refining encoder refines the grid features by capturing the intra-relationship between them, and the decoder decodes the refined features into captions word by word. Furthermore, in order to increase the interaction between multi-modal (vision and language) features to enhance the modeling capability, we calculate the mean pooling of grid features as the global feature, then introduce it into refining encoder to refine with grid features together, and add a pre-fusion process of refined global feature and generated words in decoder. To validate the effectiveness of our proposed model, we conduct experiments on MSCOCO dataset. The experimental results compared to existing published works demonstrate that our model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models) CIDEr scores on 'Karpathy' offline test split and 136.0% (c5) and 138.3% (c40) CIDEr scores on the official online test server. Trained models and source code will be released.

End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Dual-Stream Transformer for Generic Event Boundary Captioning

Dual-Stream Recurrent Neural Network for Video Captioning

EVC-MF: End-to-end Video Captioning Network with Multi-scale Features

DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding

Bidirectional transformer with knowledge graph for video captioning

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Video captioning based on vision transformer and reinforcement learning

Multi-scale features with temporal information guidance for video captioning

End-to-End Transformer Based Model for Image Captioning

Interaction augmented transformer with decoupled decoding for video captioning

Dual visual align-cross attention-based image captioning transformer

A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning

Dual-level Collaborative Transformer for Image Captioning

Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

Exploring refined dual visual features cross-combination for image captioning

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network