Abstract:Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), etc., and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.
What problem does this paper attempt to address?
The paper primarily aims to address the issue of multimodal data fusion, particularly in advanced applications within the field of deep learning. Specifically, with the advancement of sensor technology and the increase in data diversity, effectively fusing various types of data from different sensors (such as images, text, various sensor data, etc.) has become a key problem. The paper proposes a new fine-grained classification system, dividing state-of-the-art multimodal data fusion methods into five categories: encoder-decoder methods, attention mechanism methods, graph neural network methods, generative neural network methods, and other constraint-based methods.
The key contributions of the paper include:
1. **Proposing a new fine-grained classification system**: Unlike traditional early fusion, intermediate fusion, late fusion, and hybrid fusion methods, the paper proposes a more detailed classification system based on the development of modern deep learning technologies, which better reflects current research trends.
2. **Extensive coverage of multimodal combinations and tasks**: Compared to previous studies, this review covers a wider range of modality combinations (such as vision+language, vision+other sensors, etc.) and corresponding tasks (such as multimodal object segmentation, multimodal sentiment analysis, visual question answering, video captioning, etc.).
3. **Exploration of the latest development trends**: The paper also explores new trends in multimodal data fusion and provides a comparative analysis of state-of-the-art models. For example, large pre-trained models (such as Transformer-based pre-trained models) are included in the discussion, while some outdated methods (such as deep belief networks) are excluded.
In summary, this paper aims to provide a comprehensive and in-depth understanding framework for research in the field of multimodal data fusion by proposing a new classification system and extensively covering various modality combinations and tasks. Additionally, by focusing on the latest technological advancements, the paper also offers valuable guidance for future research directions.