Abstract:Multimodal sentiment classification is a notable research field that aims to refine sentimental information and classify the sentiment tendency from sequential multimodal data. Most existing sentimental recognition algorithms explore multimodal fusion schemes that achieve good performance. However, there are two key challenges to overcome. First, it is essential to effectively extract inter- and intra-modality features prior to fusion, while simultaneously reducing ambiguity. The second challenge is how to learn modality-invariant representations that capture the underlying similarities. In this paper, we present a modality-invariant temporal learning technique and a new gated inter-modality attention mechanism to overcome these issues. For the first challenge, our proposed gated inter-modality attention mechanism performs modality interactions and filters inconsistencies from multiple modalities in an adaptive manner. We also use parallel structures to learn more comprehensive sentimental information in pairs (i.e., acoustic and visual). In addition, to address the second problem, we treat each modality as a multivariate Gaussian distribution (considering each timestamp as a single Gaussian distribution) and use the KL divergence to capture the implicit temporal distribution-level similarities. These strategies are helpful in reducing domain shifts between different modalities and extracting effective sequential modality-invariant representations. We have conducted experiments on several public datasets (i.e., YouTube and MOUD) and the results show that our proposed method outperforms the state-of-the-art multimodal sentiment categorization methods.

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Neural Multimodal Cooperative Learning Toward Micro-Video Understanding

Context-aware focal alignment network for micro-video multi-label classification

Attention-enhanced and trusted multimodal learning for micro-video venue recognition

Dual-domain Aligned Deep Hierarchical Matrix Factorization Method for Micro-video Multi-label Classification

MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video

Multimodal Deep Representation Learning for Video Classification

MMM: Multi-source Multi-net Micro-video Recommendation with Clustered Hidden Item Representation Learning

MMGA: Multimodal Learning with Graph Alignment

Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object Relations

Multi-label video classification via coupling attentional multiple instance learning with label relation graph

Multimodal Learning of Social Image Representation by Exploiting Social Relations

MARN: Multi-level Attentional Reconstruction Networks for Weakly Supervised Video Temporal Grounding

Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

Attention-based Multimodal Feature Representation Model for Micro-video Recommendation

Multi-granularity cross-modal representation learning for named entity recognition on social media

Multimodal Semantic Attention Network for Video Captioning

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval