Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Jun Yu,Jing Li,Zhou Yu,Qingming Huang

DOI: https://doi.org/10.1109/tcsvt.2019.2947482

IF: 5.859

2020-12-01

IEEE Transactions on Circuits and Systems for Video Technology

Abstract:Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features from the input image, and an recurrent neural network (RNN) based caption decoder that generates the output caption words based on the visual features with the attention mechanism. Despite the success of existing studies, current methods only model the co-attention that characterizes the inter-modal interactions while neglecting the self-attention that characterizes the intra-modal interactions. Inspired by the success of the Transformer model in machine translation, here we extend it to a Multimodal Transformer (MT) model for image captioning. Compared to existing image captioning approaches, the MT model simultaneously captures intra- and inter-modal interactions in a unified attention block. Due to the in-depth modular composition of such attention blocks, the MT model can perform complex multimodal reasoning and output accurate captions. Moreover, to further improve the image captioning performance, multi-view visual features are seamlessly introduced into the MT model. We quantitatively and qualitatively evaluate our approach using the benchmark MSCOCO image captioning dataset and conduct extensive ablation studies to investigate the reasons behind its effectiveness. The experimental results show that our method significantly outperforms the previous state-of-the-art methods. With an ensemble of seven models, our solution ranks the 1st place on the real-time leaderboard of the MSCOCO image captioning challenge at the time of the writing of this paper.

engineering, electrical & electronic

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that while existing image captioning generation methods model cross - modality (i.e., the interaction between vision and text) attention, they ignore the self - attention mechanism within the modality (i.e., text - to - text and object - to - object). In addition, existing image captioning generation models are usually relatively shallow and may not fully understand the complex relationships between visual objects, and region - based visual features may not cover all objects in the image, resulting in inaccurate generated captions. To overcome these problems, the paper proposes a Multimodal Transformer (MT) model, which can simultaneously capture self - attention within the modality and co - attention between modalities. It forms a deep encoder - decoder model by deeply stacking attention blocks, thereby improving the accuracy of image captioning. Meanwhile, to further improve image captioning performance, the paper introduces multi - view visual feature learning to provide more diverse and discriminative visual representations. Experimental results show that this method is significantly superior to previous state - of - the - art methods and has achieved the first place on the real - time leaderboard in the MSCOCO image captioning challenge.

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Mmt: A Multimodal Translator For Image Captioning

End-to-End Transformer Based Model for Image Captioning

Multi-Keys Attention Network for Image Captioning

Entangled Transformer for Image Captioning

Context-Aware Transformer for image captioning

MAT: A Multimodal Attentive Translator for Image Captioning

Enhanced Modality Transition for Image Captioning

Multimodal Transformer For Multimodal Machine Translation

BENet: bi-directional enhanced network for image captioning

Multi-Modal Graph Aggregation Transformer for image captioning

Transformer with multi-level grid features and depth pooling for image captioning

Visual contextual relationship augmented transformer for image captioning

Dual visual align-cross attention-based image captioning transformer

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Transformer with token attention and attribute prediction for image captioning

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

Controllable image caption with an encoder-decoder optimization structure

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

Dual-level Collaborative Transformer for Image Captioning

A Multi-task Learning Approach for Image Captioning.