Abstract:Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

CLIP-enhanced multimodal machine translation: integrating visual and label features with transformer fusion

CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Contrastive Learning Based Visual Representation Enhancement for Multimodal Machine Translation

Multimodal Pretraining from Monolingual to Multilingual

Multimodal Transformer For Multimodal Machine Translation

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Multimodal Multilabel Classification by CLIP

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Image–Text Matching Model Based on CLIP Bimodal Encoding

Multi-Modal Adapter for Vision-Language Models

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Multilingual Multimodal Learning with Machine Translated Text

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

DAS-CL: Towards Multimodal Machine Translation Via Dual-Level Asymmetric Contrastive Learning

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Multi-grained visual pivot-guided multi-modal neural machine translation with text-aware cross-modal contrastive disentangling

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis