Abstract:Contrastive learning-based deep multi-view clustering methods have become a mainstream solution for unlabeled multi-view data. These methods usually utilize a basic structure that combines autoencoder, contrastive learning, or/and MLP projectors to generate more representative latent representations for the final clustering stage. However, existing deep contrastive multi-view clustering ignores two key points: (i) the latent representations projecting from one or more layers of MLP or new representations directly obtained from autoencoder fail to mine inherent relationship inner-view or cross-views; (ii) more existing frameworks only employ a one or dual-contrastive learning module, i.e., view- or/and category-oriented, which may result in the lack of communication between latent representations and clustering assignments. This paper proposes a new composite attention framework for contrastive multi-view clustering to address the above two challenges. Our method learns latent representations utilizing composite attention structure, i.e., Hierarchical Transformer for each view and Shared Attention for all views, rather than simple MLP. As a result, the learned representations can simultaneously preserve important features inside the view and balance the contributions across views. In addition, we add a new communication loss in our new dual contrastive framework. The common semantics will be brought into clustering assignments by pushing clustering assignments closer to the fused latent representations. Therefore, our method will provide a higher quality of clustering assignments for the segmentation problem of unlabeled multi-view data. The extensive experiments on several real data demonstrate that the proposed method can achieve superior performance over many state-of-the-art clustering algorithms, especially the significant improvement of an average of 10% on datasets Caltech and its subsets according to accuracy.

Deep Fusion: Capturing Dependencies in Contrastive Learning via Transformer Projection Heads

UniGrad-FS: Unified Gradient Projection with Flatter Sharpness for Continual Learning

Adaptive Multi-head Contrastive Learning

Full-Attention Driven Graph Contrastive Learning: with Effective Mutual Information Insight

Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances

Supplementary Material: Model-Contrastive Federated Learning

Unraveling Projection Heads in Contrastive Learning: Insights from Expansion and Shrinkage

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Long-Short Temporal Contrastive Learning of Video Transformers

Composite attention mechanism network for deep contrastive multi-view clustering

Multi-view Feature Extraction based on Dual Contrastive Head

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Curriculumformer: Taming Curriculum Pre-Training for Enhanced 3-D Point Cloud Understanding

Online Continual Learning with Contrastive Vision Transformer

Adaptive Split-Fusion Transformer

CLFT: Camera-LiDAR Fusion Transformer for Semantic Segmentation in Autonomous Driving

Point Cloud Understanding via Attention-Driven Contrastive Learning

Improving Contrastive Learning by Visualizing Feature Transformation

Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework