A Principled Framework for Explainable Multimodal Disentanglement

Zongbo Han,Tao Luo,Huazhu Fu,Qinghua Hu,Joey Tianyi Zhou,Changqing Zhang
DOI: https://doi.org/10.1016/j.ins.2024.120768
IF: 8.1
2024-01-01
Information Sciences
Abstract:Learning effective representations for data from multiple modalities is crucial in machine learning. Recent efforts focus on learning latent representations that integrate information from various modalities. These approaches generally assume simple or implicit relationships between different modalities and as a result are not able to accurately and explicitly depict the correlations among these modalities and lack explainability. To address this, we propose definitions and conditions for unsupervised multimodal disentanglement, offering guidelines for explicit disentanglement between modalities to enhance explainability. Furthermore, we have derived a novel objective function to explicitly separate multimodal data into components shared across modalities and components exclusive to each modality. The explicit guaranteed disentanglement is of great potential for downstream tasks. Benefiting from a cleverly designed network structure, we can visualize these disentangled representations, providing intuitive explainability. Experiments on a variety of multimodal datasets demonstrate that our objective can effectively disentangle information from different modalities while satisfying the disentangling conditions.
What problem does this paper attempt to address?