Abstract:Despite significant advances in the deep clustering research, there remain three critical limitations to most of the existing approaches. First, they often derive the clustering result by associating some distribution-based loss to specific network layers, neglecting the potential benefits of leveraging the contrastive sample-wise relationships. Second, they frequently focus on representation learning at the full-image scale, overlooking the discriminative information latent in partial image regions. Third, although some prior studies perform the learning process at multiple levels, they mostly lack the ability to exploit the interaction between different learning levels. To overcome these limitations, this paper presents a novel deep image clustering approach via Partial Information discrimination and Cross-level Interaction (PICI). Specifically, we utilize a Transformer encoder as the backbone, coupled with two types of augmentations to formulate two parallel views. The augmented samples, integrated with masked patches, are processed through the Transformer encoder to produce the class tokens. Subsequently, three partial information learning modules are jointly enforced, namely, the partial information self-discrimination (PISD) module for masked image reconstruction, the partial information contrastive discrimination (PICD) module for the simultaneous instance- and cluster-level contrastive learning, and the cross-level interaction (CLI) module to ensure the consistency across different learning levels. Through this unified formulation, our PICI approach for the first time, to our knowledge, bridges the gap between the masked image modeling and the deep contrastive clustering, offering a novel pathway for enhanced representation learning and clustering. Experimental results across six image datasets demonstrate the superiority of our PICI approach over the state-of-the-art. In particular, our approach achieves an ACC of 0.772 (0.634) on the RSOD (UC-Merced) dataset, which shows an improvement of 29.7% (24.8%) over the best baseline. The source code is available at https://github.com/Regan-Zhang/PICI.

Two-stage partial image-text clustering (TPIT-C)

Mejigclu: more effective jigsaw clustering for unsupervised visual representation learning

ECCT: Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation

Dual-Level Cross-Modal Contrastive Clustering

Learning clustering-friendly representations via partial information discrimination and cross-level interaction

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

Web Image Clustering by Consistent Utilization of Visual Features and Surrounding Texts.

Text-Guided Alternative Image Clustering

Learning Representations for Clustering via Partial Information Discrimination and Cross-Level Interaction

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly Supervised Text-based Person Re-Identification

Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Turning a CLIP modal into image-text matching

Tensorized Bipartite Graph Learning for Multi-View Clustering.

Clustering swap prediction for image-text pre-training

Image Clustering Conditioned on Text Criteria

TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

Image paragraph captioning with topic clustering and topic shift prediction

Contrastive completing learning for practical text–image person ReID: Robuster and cheaper

Multi-level Cross-modal Alignment for Image Clustering

Image Clustering with External Guidance