Abstract:Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

Introduction to the Special Issue on Deep Learning for Multi-Modal Intelligence Across Speech, Language, Vision, and Heterogeneous Signals

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Recent Advances and Trends in Multimodal Deep Learning: A Review

Vision+X: A Survey on Multimodal Learning in the Light of Data

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Big Cross-Modal Social Media Data Analytics with Deep Intelligence

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Learn to Combine Modalities in Multimodal Deep Learning

Special issue on deep learning-based neural information processing for big data analytics

Deep Multimodal Data Fusion

A Review on Methods and Applications in Multimodal Deep Learning

Editorial: Introduction to the Special Issue on Deep Learning for High-Dimensional Sensing

A survey of multi-modal learning theory

Guest Editorial Introduction to the Special Section on Video and Language

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey

A scoping review on multimodal deep learning in biomedical images and texts

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification