Abstract:Obtaining an effective joint representation has always been the goal for multimodal tasks. However, distributional gap inevitably exists due to the heterogeneous nature of different modalities, which poses burden on the fusion process and the learning of multimodal representation. The imbalance of modality dominance further aggravates this problem, where inferior modalities may contain much redundancy that introduces additional variations. To address the aforementioned issues, we propose a Disentanglement Translation Network (DTN) with Slack Reconstruction to capture desirable information properties, obtain a unified feature distribution and reduce redundancy. Specifically, the encoder–decoder-based disentanglement framework is adopted to decouple the unimodal representations into modality-common and modality-specific subspaces, which explores the cross-modal commonality and diversity, respectively. In the encoding stage, to narrow down the discrepancy, a two-stage translation is devised to incorporate with the disentanglement learning framework. The first stage targets at learning modality-invariant embedding for modality-common information with adversarial learning strategy, capturing the commonality shared across modalities. The second stage considers the modality-specific information that reveals diversity. To relieve the burden of multimodal fusion, we realize Specific-Common Distribution Matching to further unify the distribution of the desirable information. As for the decoding and reconstruction stage, we propose Slack Reconstruction to seek a balance between retaining discriminative information and reducing redundancy. Although the existing commonly-used reconstruction loss with strict constraint lowers the risk of information loss, it easily leads to the preservation of information redundancy. In contrast, Slack Reconstruction imposes a more relaxed constraint so that the redundancy is not forced to be retained, and simultaneously explores the inter-sample relationships. The proposed method aids multimodal fusion by learning the exact properties and obtaining a more uniform distribution for cross-modal data, and manages to reduce information redundancy to further ensure feature effectiveness. Extensive experiments on the task of multimodal sentiment analysis indicate the effectiveness of the proposed method. The codes are available at https://github.com/zengy268/DTN .

Learning Disentangled Representation for Multimodal Cross-Domain Sentiment Analysis.

Cross-Culture Multimodal Emotion Recognition With Adversarial Learning

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Towards Self-Similarity Consistency and Feature Discrimination for Unsupervised Domain Adaptation.

Learning Disentangled Semantic Representation for Domain Adaptation

Adversarial Multi-dimensional Distribution Alignment for Cross-domain Sentiment Analysis

Hierarchical denoising representation disentanglement and dual-channel cross-modal-context interaction for multimodal sentiment analysis

Triple Disentangled Representation Learning for Multimodal Affective Analysis

Weighed Domain-Invariant Representation Learning for Cross-domain Sentiment Analysis

Learning Speaker-Independent Multimodal Representation for Sentiment Analysis

Learning Disentangled Representation via Domain Adaptation for Dialogue Summarization

Topic Driven Adaptive Network for cross-domain sentiment classification

Disentanglement Translation Network for multimodal sentiment analysis

Deep Multi-Modality Adversarial Networks for Unsupervised Domain Adaptation

Cross-Domain Sentiment Classification with Target Domain Specific Information

Domain Adaptation for Sentiment Analysis Using Increased Intraclass Separation

Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning

Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition

Deep Margin-Sensitive Representation Learning for Cross-Domain Facial Expression Recognition

Learning intra-domain style-invariant representation for unsupervised domain adaptation of semantic segmentation

MInD: Improving Multimodal Sentiment Analysis via Multimodal Information Disentanglement