Two-stage partial image-text clustering (TPIT-C)

Dongjin Guo,Xiaoming Su,Yahong Lian,Limin Liu,Haibo Wang
DOI: https://doi.org/10.1049/cvi2.12117
IF: 1.484
2022-01-01
IET Computer Vision
Abstract:Deep multi-model clustering is a challenging task for data analysis since it learns a universal semantic representation to find correct clusters from heterogeneous samples. However, most existing methods 1) lack an effective approach to getting a global representation of visual instances, which results in a huge semantic gap between visual and textual space. 2) hardly consider partial multi-modal, where each instance is represented by only one modality. In reality, the pairing information for modalities is not available for all instances. To tackle the above issues, we propose a novel model called the Two-Stage Partial Image-Text Clustering (TPIT-C) model. Firstly, we build an interpretable reasoning network to obtain the salient regions and semantic concepts of the scene in order to generate global semantic concepts. Secondly, we construct an adversarial learning module to align textual and visual instances into a unified space by virtue of cycle-consistency. The experimental evaluations on public unpaired multi-model datasets illustrated that the proposed method has better performance and the effectiveness of our algorithm in the partial image-text clustering task.
What problem does this paper attempt to address?