Abstract:Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

Multimodal Representation Learning With Text and Images

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Unsupervised Multimodal Language Representations using Convolutional Autoencoders

Task-agnostic representation learning of multimodal twitter data for downstream applications

Multimodal Representation Learning via Maximization of Local Mutual Information

Multimodal Learning of Social Image Representation by Exploiting Social Relations

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Universal Multimodal Representation for Language Understanding

Multimodal Deep Representation Learning for Video Classification

Multimodal Representation Learning by Alternating Unimodal Adaptation

Attribution Regularization for Multimodal Paradigms

Semi-supervised Multimodal Representation Learning through a Global Workspace

Brain-inspired Multimodal Learning Based on Neural Networks

Multimodal sparse representation learning and applications

Multi-Modal Representation Learning with Text-Driven Soft Masks

Using Multiple Instance Learning to Build Multimodal Representations

Learning deep representation of multityped objects and tasks

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Training Multimodal Systems for Classification with Multiple Objectives

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Advanced Multimodal Deep Learning Architecture for Image-Text Matching