Abstract:Obtaining an effective joint representation has always been the goal for multimodal tasks. However, distributional gap inevitably exists due to the heterogeneous nature of different modalities, which poses burden on the fusion process and the learning of multimodal representation. The imbalance of modality dominance further aggravates this problem, where inferior modalities may contain much redundancy that introduces additional variations. To address the aforementioned issues, we propose a Disentanglement Translation Network (DTN) with Slack Reconstruction to capture desirable information properties, obtain a unified feature distribution and reduce redundancy. Specifically, the encoder–decoder-based disentanglement framework is adopted to decouple the unimodal representations into modality-common and modality-specific subspaces, which explores the cross-modal commonality and diversity, respectively. In the encoding stage, to narrow down the discrepancy, a two-stage translation is devised to incorporate with the disentanglement learning framework. The first stage targets at learning modality-invariant embedding for modality-common information with adversarial learning strategy, capturing the commonality shared across modalities. The second stage considers the modality-specific information that reveals diversity. To relieve the burden of multimodal fusion, we realize Specific-Common Distribution Matching to further unify the distribution of the desirable information. As for the decoding and reconstruction stage, we propose Slack Reconstruction to seek a balance between retaining discriminative information and reducing redundancy. Although the existing commonly-used reconstruction loss with strict constraint lowers the risk of information loss, it easily leads to the preservation of information redundancy. In contrast, Slack Reconstruction imposes a more relaxed constraint so that the redundancy is not forced to be retained, and simultaneously explores the inter-sample relationships. The proposed method aids multimodal fusion by learning the exact properties and obtaining a more uniform distribution for cross-modal data, and manages to reduce information redundancy to further ensure feature effectiveness. Extensive experiments on the task of multimodal sentiment analysis indicate the effectiveness of the proposed method. The codes are available at https://github.com/zengy268/DTN .

A Principled Framework for Explainable Multimodal Disentanglement

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations

A Concept-Based Explainability Framework for Large Multimodal Models

Mutual Information-based Representations Disentanglement for Unaligned Multimodal Language Sequences

Multimodal Sentiment Analysis Based on Disentangled Representation Learning and Cross-Modal-context Association Mining

Multimodal Disentangled Representation for Recommendation

Disentangling Multi-view Representations Beyond Inductive Bias

Improving Explainability of Disentangled Representations using Multipath-Attribution Mappings

Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data

Independence Constrained Disentangled Representation Learning from Epistemological Perspective

Disentangling Factors of Variation in Deep Representations Using Adversarial Training.

Triple Disentangled Representation Learning for Multimodal Affective Analysis

Disentanglement Translation Network for multimodal sentiment analysis

An Information Criterion for Controlled Disentanglement of Multimodal Data

Disentangled Multimodal Representation Learning for Recommendation.

Explainability Enhanced Object Detection Transformer with Feature Disentanglement

MInD: Improving Multimodal Sentiment Analysis via Multimodal Information Disentanglement

Explaining Multimodal Data Fusion: Occlusion Analysis for Wilderness Mapping

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

Interpretable Disentanglement of Neural Networks by Extracting Class-Specific Subnetwork