Soft Subspace Clustering of Categorical Data with Probabilistic Distance

Lifei Chen,Shengrui Wang,Kaijun Wang,Jianping Zhu
DOI: https://doi.org/10.1016/j.patcog.2015.09.027
IF: 8
2016-01-01
Pattern Recognition
Abstract:Categorical data clustering is an important subject in pattern recognition. Currently, subspace clustering of categorical data remains an open problem due to the difficulties in estimating attribute interestingness according to the statistics of categories in clusters. In this paper, a new algorithm is proposed for clustering categorical data with a novel soft feature-selection scheme, by which each categorical attribute is automatically assigned a weight that correlates with the smoothed dispersion of the categories in a cluster. In the proposed algorithm, dissimilarity between categorical data objects is measured using a probabilistic distance function, based on kernel density estimation for categorical attributes. We also make use of the probabilistic distances to define a cluster validity index for estimating the number of categorical clusters. The suitability of the proposal is demonstrated in an empirical study done with some widely used real-world data sets and synthetic data sets, and the results show its outstanding performance.
What problem does this paper attempt to address?