Abstract:Inspired by the McGurk effect, studies on multimodal data fusion start with audio-visual speech recognition tasks. Multimodal data fusion research was not popular for a period of time because the capacity of traditional machine learning is limited. Recently, advances in deep learning techniques have provided new opportunities for multimodal data fusion. Powerful deep learning models have the capacity to process high-dimensional and com-plex multimodal data, and multimodal deep learning has the potential to process multi -modal data at the human level. However, there is still a lack of theoretical analytical methods relating data information with model performance. In this work, we propose basic concepts and principles to gain insight into the process of multimodal data fusion from an information theory perspective. We analyze different multimodal data fusion cases, such as redundant, noisy, consistent, and contradictory data fusion. We define the model accuracy upper bound for multimodal tasks and prove that a multimodal model with an extra modal channel can perform better in theory when extra modal data provide more effective infor-mation for prediction. We explicitly inspect the latent representation space and analyze the information loss of the representation space transformation in deep learning for the first time. From a naive example to a multimodal deep learning example, we demonstrate the theoretical analysis method for evaluating a multimodal data fusion model, and the experimental results validate the definitions and principles.(c) 2022 Elsevier Inc. All rights reserved.

On the Benefits of Early Fusion in Multimodal Representation Learning

Progressive Fusion for Multimodal Integration

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

Learning Joint Multimodal Representation Based On Multi-Fusion Deep Neural Networks

Attention Bottlenecks for Multimodal Fusion

Dense Multimodal Fusion for Hierarchically Joint Representation

Multimodal Language Analysis with Recurrent Multistage Fusion

MMTM: Multimodal Transfer Module for CNN Fusion

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling

Multimodal Contrastive Learning for Brain-Machine Fusion: from Brain-in-the-loop Modeling to Brain-out-of-the-loop Application

Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion

Analysis of Multimodal Data Fusion from an Information Theory Perspective

Memory based fusion for multi-modal deep learning

Neural Dependency Coding inspired Multimodal Fusion

Brain-inspired Multimodal Learning Based on Neural Networks

CentralNet: a Multilayer Approach for Multimodal Fusion

Locally Confined Modality Fusion Network with a Global Perspective for Multimodal Human Affective Computing

Deep Multimodal Learning for Audio-Visual Speech Recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

MSAF: Multimodal Split Attention Fusion

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation