Abstract:Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

Multimodal information fusion for selected multimedia applications

Emotion Recognition in Videos via Fusing Multimodal Features.

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal information fusion for human-robot interaction

A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition

Multimodal Fusion for Robotics

Optimal Multimodal Fusion for Multimedia Data Analysis

Multimodal Emotion Recognition Based on Feature Fusion.

Data fusion methods in multimodal human computer dialog.

Intelligence Methods of Multi-Modal Information Fusion in Human-Computer Interaction

Deep Multimodal Data Fusion

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Weakly Paired Multimodal Fusion for Object Recognition.

Multimodal Sensors and ML‐Based Data Fusion for Advanced Robots

Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis

Multimodal Affective Computing Based on Weighted Linear Fusion

Decision Making of Mobile Robot based on Multimodal Fusion

An Effective Multimodal Representation and Fusion Method for Multimodal Intent Recognition

Multimodal Emotion Recognition Based on Facial Expressions, Speech, and Body Gestures

Multimodal Emotion Recognition Using Different Fusion Techniques

The Labeled Multiple Canonical Correlation Analysis for Information Fusion