Abstract:Image captioning is a challenging problem. Different from other computer vision tasks such as image classification and object detection, image captioning requires not only understanding the image, but also the knowledge of natural language. In this work, we formulate the problem of image captioning as a multimodal translation task. Analogous to machine translation, we present a sequence-to-sequence recurrent neural network (RNN) model for image caption generation. Different from most existing work where the whole image is represented by a convolutional neural network (CNN) feature, we propose to represent the input image as a sequence of detected objects to serve as the source sequence of the RNN model. In this way, the sequential representation of an image can be naturally translated into a sequence of words, as the target sequence of the RNN model. To obtain the source sequence from the image, objects are first detected by pre-trained detectors and then converted to a sequential representation using heuristic ordering strategies, that is, by the saliency scores of the detected objects. We propose three ordering methods, descending, ascending and random, according to the saliency scores, in order to study the influence of ordering over RNN cells. To obtain the target sequence, the language words are represented as one-hot feature vector. The representations of the objects and the words are then mapped into a common hidden space. The translation from the source sequence to the target sequence is done by leveraging LSTM. Extensive experiments are conducted to evaluate the proposed approach on benchmark dataset, i.e., MSCOCO, and achieve the state-of-the-art performance. The proposed approach is also evaluated by the evaluation server of MS COCO captioning challenge and achieves very competitive results. For example, we achieve CIDEr of 93.2, RougeL of 53.2 and BLEU4 of 31.1. We validate the contribution of each idea, that is, sequential representation and ordering method, by comparison studies, and show that sequential representation indeed improves the performance compared to vanilla CNN + RNN based methods, and ascending ordering outperforms the other two ordering methods.

Multi-Level Multimodal Transformer Network for Multimodal Recipe Comprehension

Multi-Matching Network for Multiple Choice Reading Comprehension

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model.

MCEN: Bridging Cross-Modal Gap Between Cooking Recipes and Dish Images with Latent Variable Model

Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To Benchmark

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Multimodal Instruction Tuning with Hybrid State Space Models

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Meta-Transformer: A Unified Framework for Multimodal Learning

Multilevel Transformer For Multimodal Emotion Recognition

Multimodal Transformer For Multimodal Machine Translation

M3: A Multi-View Fusion and Multi-Decoding Network for Multi-Document Reading Comprehension.

TransModality: An End2End Fusion Method with Transformer for Multimodal Sentiment Analysis

Mmt: A Multimodal Translator For Image Captioning

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Multimodal Analysis for Deep Video Understanding with Video Language Transformer

MulT: An End-to-End Multitask Learning Transformer

Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion