Contrastive Hierarchical Clustering

Michał Znaleźniak,Przemysław Rola,Patryk Kaszuba,Jacek Tabor,Marek Śmieja
2023-06-22
Abstract:Deep clustering has been dominated by flat models, which split a dataset into a predefined number of groups. Although recent methods achieve an extremely high similarity with the ground truth on popular benchmarks, the information contained in the flat partition is limited. In this paper, we introduce CoHiClust, a Contrastive Hierarchical Clustering model based on deep neural networks, which can be applied to typical image data. By employing a self-supervised learning approach, CoHiClust distills the base network into a binary tree without access to any labeled data. The hierarchical clustering structure can be used to analyze the relationship between clusters, as well as to measure the similarity between data points. Experiments demonstrate that CoHiClust generates a reasonable structure of clusters, which is consistent with our intuition and image semantics. Moreover, it obtains superior clustering accuracy on most of the image datasets compared to the state-of-the-art flat clustering models.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that most of the current deep clustering methods focus on flat models, which divide the data set into a predefined number of groups. Although recent methods have extremely high similarity with the real labels on popular benchmarks, the flat partitions contain limited information. In addition, in the existing deep learning frameworks, the application of hierarchical clustering is relatively rare, especially when dealing with color image data sets. To fill this gap, the authors propose **CoHiClust** (Contrastive Hierarchical Clustering), a contrast - based hierarchical clustering model on deep neural networks. This model aims to generate a reasonable hierarchical structure through self - supervised learning methods without accessing any labeled data. Specifically, CoHiClust distills the base neural network into a binary tree structure, which can be used to analyze the relationships between clusters and measure the similarity between data points. ### Main Problem Summary 1. **Information Limitations**: Traditional flat clustering models can only provide limited information and cannot capture the complex relationships between data. 2. **Insufficient Application of Hierarchical Clustering in Deep Learning**: Although hierarchical clustering has been widely used in classical machine learning, its application in the field of deep learning, especially on color image data sets, is relatively scarce. 3. **Requirement for Unsupervised Learning**: Many existing methods rely on labeled data for training, while CoHiClust aims to achieve unsupervised hierarchical clustering through self - supervised learning. ### Solutions - **Model Design**: CoHiClust uses a deep neural network to generate high - dimensional representations and converts them into a binary tree structure through a projection head. - **Loss Function**: A contrastive hierarchical loss function is introduced to ensure that similar data points follow the same path in the tree. - **Regularization Strategy**: Optimize the model through two regularization strategies (R1 and R2) to ensure the balance of the tree and the effectiveness of the representation. ### Experimental Verification The experimental results show that CoHiClust outperforms the existing flat clustering models on multiple image data sets and can generate hierarchical structures that are in line with intuition and image semantics. For example, on the CIFAR - 10 and ImageNet - 10 data sets, CoHiClust performs particularly well. ### Formula Display - Contrastive Hierarchical Loss Function: \[ \text{CoHiLoss}=\frac{1}{N(N - 1)}\sum_{j = 1}^{N}\sum_{i\neq j}s(x_j,\tilde{x}_i)-\frac{1}{N}\sum_{j = 1}^{N}s(x_j,\tilde{x}_j) \] where \(s(x_1,x_2)\) is the similarity score between data points \(x_1\) and \(x_2\). - Regularization Term: \[ \text{Loss}=\text{CoHiLoss}+\beta_1R_1+\beta_2R_2 \] where \(\beta_1\) and \(\beta_2\) are hyperparameters that control the importance of the regularization terms \(R_1\) and \(R_2\) respectively. Through these improvements, CoHiClust not only has achieved a significant improvement in clustering performance, but also provided new ideas for the application of hierarchical clustering in deep learning.