Abstract:Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), etc., and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.

What problem does this paper attempt to address?

The paper primarily aims to address the issue of multimodal data fusion, particularly in advanced applications within the field of deep learning. Specifically, with the advancement of sensor technology and the increase in data diversity, effectively fusing various types of data from different sensors (such as images, text, various sensor data, etc.) has become a key problem. The paper proposes a new fine-grained classification system, dividing state-of-the-art multimodal data fusion methods into five categories: encoder-decoder methods, attention mechanism methods, graph neural network methods, generative neural network methods, and other constraint-based methods. The key contributions of the paper include: 1. **Proposing a new fine-grained classification system**: Unlike traditional early fusion, intermediate fusion, late fusion, and hybrid fusion methods, the paper proposes a more detailed classification system based on the development of modern deep learning technologies, which better reflects current research trends. 2. **Extensive coverage of multimodal combinations and tasks**: Compared to previous studies, this review covers a wider range of modality combinations (such as vision+language, vision+other sensors, etc.) and corresponding tasks (such as multimodal object segmentation, multimodal sentiment analysis, visual question answering, video captioning, etc.). 3. **Exploration of the latest development trends**: The paper also explores new trends in multimodal data fusion and provides a comparative analysis of state-of-the-art models. For example, large pre-trained models (such as Transformer-based pre-trained models) are included in the discussion, while some outdated methods (such as deep belief networks) are excluded. In summary, this paper aims to provide a comprehensive and in-depth understanding framework for research in the field of multimodal data fusion by proposing a new classification system and extensively covering various modality combinations and tasks. Additionally, by focusing on the latest technological advancements, the paper also offers valuable guidance for future research directions.

Deep Multimodal Data Fusion

A Survey on Deep Learning for Multimodal Data Fusion

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry and Fusion

Deep Learning Based Multimodal Biomedical Data Fusion: an Overview and Comparative Review

Multimodal Fusion on Low-quality Data: A Comprehensive Survey

Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Multimodal Fusion Method Based on Self-Attention Mechanism

Multimodal Medical Image Fusion: The Perspective of Deep Learning

Multimodal Fusion of Brain Imaging Data: Methods and Applications

Multimodal Alignment and Fusion: A Survey

Deep Fusion Of Heterogeneous Sensor Data

Adaptive Fusion Techniques for Multimodal Data

Deep Equilibrium Multimodal Fusion

Multimodal fusion for multimedia analysis: a survey

Multimodal deep learning for biomedical data fusion: a review

Multimodal image fusion: A systematic review

Deep Learning in Multimodal Remote Sensing Data Fusion: A Comprehensive Review

Multi-modal Sensor Fusion for Auto Driving Perception: A Survey