Exploring automated Feature Selection for Model-based and Density-based clustering with application to NCI 60 data

Suruchi Jai Kumar Ahuja
DOI: https://doi.org/10.1101/2024.04.21.589433
2024-04-26
Abstract:A major objective of clustering is to identify groups in the data that maximizes the similarity between objects within the same cluster and minimizes the similarity between different clusters. A challenge for data clustering, and unsupervised learning in general, is that there is often no mechanism for feature selection. In contrast, supervised learning problems can be solved in connection with feature selection methods, such as subset selection or LASSO like penalties. However, variable selection in unsupervised learning problems is not well defined since there is no response variable, which makes subset selection is far more challenging. Consequently, there have been comparatively few methods that automate feature selection for clustering. Typically, when faced with high dimensionality, or the possibility of irrelevant features, an investigator will employ dimension reduction techniques with standard clustering algorithms. In this work, I examine two methods that encode feature selection into the clustering process, cluster variable selection via model-based clustering and density-based clustering, using the Clustvarsel and DBSCAN packages in the R programming language. These methods were applied to the NCI-60 data and compared to the Principal Component based k-means over different parameter settings. Results indicate major advantages in the performance of PC based k-means when compared to feature selection via Clustvarsel and DBSCAN, and major limitations in Clustvarsel.
Bioinformatics
What problem does this paper attempt to address?