Abstract:Obtaining an effective joint representation has always been the goal for multimodal tasks. However, distributional gap inevitably exists due to the heterogeneous nature of different modalities, which poses burden on the fusion process and the learning of multimodal representation. The imbalance of modality dominance further aggravates this problem, where inferior modalities may contain much redundancy that introduces additional variations. To address the aforementioned issues, we propose a Disentanglement Translation Network (DTN) with Slack Reconstruction to capture desirable information properties, obtain a unified feature distribution and reduce redundancy. Specifically, the encoder–decoder-based disentanglement framework is adopted to decouple the unimodal representations into modality-common and modality-specific subspaces, which explores the cross-modal commonality and diversity, respectively. In the encoding stage, to narrow down the discrepancy, a two-stage translation is devised to incorporate with the disentanglement learning framework. The first stage targets at learning modality-invariant embedding for modality-common information with adversarial learning strategy, capturing the commonality shared across modalities. The second stage considers the modality-specific information that reveals diversity. To relieve the burden of multimodal fusion, we realize Specific-Common Distribution Matching to further unify the distribution of the desirable information. As for the decoding and reconstruction stage, we propose Slack Reconstruction to seek a balance between retaining discriminative information and reducing redundancy. Although the existing commonly-used reconstruction loss with strict constraint lowers the risk of information loss, it easily leads to the preservation of information redundancy. In contrast, Slack Reconstruction imposes a more relaxed constraint so that the redundancy is not forced to be retained, and simultaneously explores the inter-sample relationships. The proposed method aids multimodal fusion by learning the exact properties and obtaining a more uniform distribution for cross-modal data, and manages to reduce information redundancy to further ensure feature effectiveness. Extensive experiments on the task of multimodal sentiment analysis indicate the effectiveness of the proposed method. The codes are available at https://github.com/zengy268/DTN .

Inter-Intra Modal Representation Augmentation with Trimodal Collaborative Disentanglement Network for Multimodal Sentiment Analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Hierarchical denoising representation disentanglement and dual-channel cross-modal-context interaction for multimodal sentiment analysis

TMMDA: A New Token Mixup Multimodal Data Augmentation for Multimodal Sentiment Analysis

Cross-modal Enhancement Network for Multimodal Sentiment Analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Tri-Modalities Fusion for Multimodal Sentiment Analysis

Disentanglement Translation Network for multimodal sentiment analysis

MInD: Improving Multimodal Sentiment Analysis via Multimodal Information Disentanglement

Adaptive Modality Distillation for Separable Multimodal Sentiment Analysis

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network

Token-disentangling Mutual Transformer for multimodal emotion recognition

Triple Disentangled Representation Learning for Multimodal Affective Analysis

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

FDR-MSA: Enhancing multimodal sentiment analysis through feature disentanglement and reconstruction

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Missing Modality Reconstruction Network Based on Shared-Specific Features

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments