ConGMC: Consistency-Guided Multimodal Clustering via Mutual Information Maximin

Jiaming Liu,Yiqiao Mao,Xiaoqiang Yan,Yangdong Ye
DOI: https://doi.org/10.1109/TMM.2023.3330093
IF: 7.3
IEEE Transactions on Multimedia
Abstract:Aligning multiple heterogeneous modalities in a parameter-sharing encoder to mine consistent information is a core idea of multimodal learning. However, two drawbacks hinder the development of such methods for clustering tasks: 1) each modality contains a considerable amount of superfluous information that cannot be aligned, impeding the mining of consistent information and 2) one-to-one alignment is contradictory to the clustering principle of minimum intra-cluster distance, leading to suboptimal clustering results. In this paper, we propose a novel Consistency-Guided Multimodal Clustering method (ConGMC) to remove superfluous information within the modalities unsupervised through information theory while improving one-to-one alignment for the clustering task. ConGMC contains multiple unimodal encoders and a multimodal shared encoder, where the former learns unimodal representation while the latter aligns multiple modalities to learn the cluster partition. Specifically, we first construct a mutual information maximin function to distinguish consistent information from superfluous information, in which the consistent and superfluous information are maximally retained and removed, respectively. Then a Clustering-Friendly Alignment strategy (CF-Align) is designed to address the contradiction between the alignment and clustering tasks. CF-Align dynamically adjusts the set of negative samples according to the learned cluster partition to avoid increasing the intra-cluster distance. Finally, we consider the cluster partition as a consistent constraint to optimize the multimodal shared encoder, enabling consistent information to guide the training process iteratively. Moreover, a variational optimization algorithm is proposed to ensure that ConGMC converges to a local optimum. Numerous experimental results on twelve real-world datasets validate that the proposed ConGMC method outperforms the state-of-the-art multimodal clustering methods.
Computer Science
What problem does this paper attempt to address?