Abstract:In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences. A majority of existing works generally focus on aligned fusion, mostly at word level, of the three modalities to accomplish this task, which is impractical in real-world scenarios. To overcome this issue, we seek to address the task of multimodal sequence analysis on unaligned modality sequences which is still relatively underexplored and also more challenging. Recurrent neural network (RNN) and its variants are widely used in multimodal sequence analysis, but they are susceptible to the issues of gradient vanishing/explosion and high time complexity due to its recurrent nature. Therefore, we propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. The graph-based structure enables parallel computation in time dimension and can learn longer temporal dependency in long unaligned sequences. Specifically, our Multimodal Graph is hierarchically structured to cater to two stages, i.e., intra- and inter-modal dynamics learning. For the first stage, a graph convolutional network is employed for each modality to learn intra-modal dynamics. In the second stage, given that the multimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we devise a graph pooling fusion network to automatically learn the associations between various nodes from different modalities. Additionally, we define multiple ways to construct the adjacency matrix for sequential data. Experimental results suggest that our graph-based model reaches state-of-the-art performance on two benchmark datasets.

Gated multimodal networks

Gated Multimodal Units for Information Fusion

X-Gacmn: An X-Shaped Generative Adversarial Cross-Modal Network With Hypersphere Embedding

CMCI: A Robust Multimodal Fusion Method for Spiking Neural Networks

MMTM: Multimodal Transfer Module for CNN Fusion

Gated attention fusion network for multimodal sentiment classification

MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks

MSTGC: Multi-Channel Spatio-Temporal Graph Convolution Network for Multi-Modal Brain Networks Fusion

Gated Value Network for Multilabel Classification.

Graph Metanetworks for Processing Diverse Neural Architectures

CentralNet: a Multilayer Approach for Multimodal Fusion

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

Pre-gating and contextual attention gate - A new fusion method for multi-modal data tasks

Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Adaptive Fusion Techniques for Multimodal Data

Multiplex Graph Networks for Multimodal Brain Network Analysis

Multimodal Understanding Through Correlation Maximization and Minimization

Neural Dependency Coding inspired Multimodal Fusion

Interpretable Multimodal Fusion Networks Reveal Mechanisms of Brain Cognition

A Heterogeneous Graph Based Framework for Multimodal Neuroimaging Fusion Learning