Estimation of number of clusters in categorical data via distance-based likelihood function

Peng Zhang,Yaolong Feng,Xiaogang Wang
DOI: https://doi.org/10.1109/ICNC.2011.6022590
2011-01-01
Abstract:We propose a new approach to selecting the number of clusters for categorical data via the likelihood function based on Hamming distances. Properties of the random variable of the distance of categorical data and the maximum likelihood estimators are discussed. An expected maximized log-likelihood function on data of a unique cluster is computed using simulated data. Changes in the maximized log-likelihood functions with respect to different numbers of clusters are compared with the thresholds obtained from the expected counterparts. The estimated number of clusters is chosen to be the first integer that the former change is no more significantly larger than the latter change. Simulation studies are carried out to examine the accuracy of the proposed method. We also give an example of real data analysis in the paper.
What problem does this paper attempt to address?