Investigation on Several Model Selection Criteria for Determining the Number of Cluster

Xuelei Hu,Lei Xu
2004-01-01
Abstract:Determining the number of clusters is a crucial problem in clustering. Conven- tionally, selection of the number of clusters was effected via cost function based criteria such as Akaike's information criterion (AIC), the consistent Akaike's information criterion (CAIC), the minimum description length (MDL) criterion which formally coincides with the Bayesian inference criterion (BIC). In this paper we study Bayesian Ying-Yang (BYY) harmony learning for model selection via comparing BYY harmony data smoothing criterion (BYY-HDS) with several typical model selection criteria, including AIC, CAIC, and MDL. We empirically investigate model selection on clustering using all these methods on simulated data sets under different sample sizes and real data sets including the well-known iris data set and a gene expression data set. The results of experiments illustrate that BYY-HDS outperforms other methods, especially for small sample size. CAIC and MDL tend to underestimate the number of clusters, while AIC tends to overestimate the number of clusters especially in the case of small sample size.
What problem does this paper attempt to address?