CLAMS: A System for Zero-Shot Model Selection for Clustering

Prabhant Singh,Pieter Gijsbers,Murat Onur Yildirim,Elif Ceren Gok,Joaquin Vanschoren
2024-07-16
Abstract:We propose an AutoML system that enables model selection on clustering problems by leveraging optimal transport-based dataset similarity. Our objective is to establish a comprehensive AutoML pipeline for clustering problems and provide recommendations for selecting the most suitable algorithms, thus opening up a new area of AutoML beyond the traditional supervised learning settings. We compare our results against multiple clustering baselines and find that it outperforms all of them, hence demonstrating the utility of similarity-based automated model selection for solving clustering applications.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to automatically select the most appropriate model for the clustering problem in the absence of labels. Specifically, the author proposes a brand - new zero - shot model selection framework, which realizes the automatic clustering model recommendation for new datasets by pre - training on internal and external clustering validity indices (CVIs) from previous tasks. ### Core Problems of the Paper 1. **Challenges in Clustering Problems** - Clustering is a form of unsupervised learning, and the lack of labeled data makes it difficult to apply traditional AutoML techniques. - Traditional clustering methods rely on internal or external clustering validity indices (such as Calinski - Harabasz, Silhouette scores, etc.), but there is a lack of strong correlation among these metrics, and it becomes very difficult to optimize the external CVI for new datasets without labels. 2. **Limitations of Existing Methods** - Existing automated clustering methods usually only optimize the number of clusters or select algorithms through meta - learning, but most of these methods require labels or specific evaluation metrics. - Current AutoML systems mainly focus on supervised learning tasks, and there are relatively few automated solutions for clustering problems. ### Proposed Solutions The author proposes CLAMS (Clustering with Automated Machine Learning System) and its extended version CLAMS - OT (Zero - Shot Model Recommendation System Based on Optimal Transport) to solve the above problems: - **CLAMS**: An independent AutoML tool that can automatically select clustering algorithms and their hyper - parameter configurations. It supports internal and external CVIs and allows for complete pipeline selection. - **CLAMS - OT**: Utilizes the Optimal Transport Distance (OTD) to measure the similarity between different datasets, so as to recommend the most suitable clustering model without the need for labels. This method compares the new dataset with known datasets through pre - processing and transformation functions and selects the optimal clustering algorithm corresponding to the most similar dataset. ### Method Innovation Points - **Application of Optimal Transport Distance**: By introducing the low - rank Gromov - Wasserstein distance (GW - LR), a good balance between computational efficiency and accuracy can be achieved, thereby effectively measuring the similarity between different datasets. - **Zero - Shot Recommendation**: CLAMS - OT can quickly recommend a clustering model suitable for a new dataset without any labels, which is of great significance for practical application scenarios. ### Experimental Verification The author verifies the effectiveness of CLAMS - OT through experiments, and the results show that this method significantly outperforms existing clustering methods in multiple benchmark tests. In particular, CLAMS - OT shows higher performance on the Adjusted Mutual Information (AMI) metric. ### Summary This paper provides a novel and effective automated clustering solution by introducing optimal transport theory and zero - shot recommendation mechanisms, filling the gap in existing AutoML techniques in the field of unsupervised learning.