ESCHCD: entropy-based algorithm for subspace clustering with high dimensional categorical datasets

SUN Hao-jun,DU Yu-lin,JIANG Da-zhi
2011-01-01
Abstract:In high dimensional categorical data datasets,the lack of exact measurement of similarity between data and the distributions of the data are usually sparse.This makes most of those traditional clustering algorithms which work well on low-dimensional data invalid for high-dimensional categorical data datasets.Focusing on these problems,a new high dimensional categorical clustering algorithm was proposed,called ESCHCD(entropy-based algorithm for subspace clustering with high dimensions categorical datasets).An effective and unsupervised objective function was designed to determine the subspace associated with each cluster by considering the entropies of the matched subspace and the noise subspace.At the same time,an average entropy-based global optimization method was also proposed to find the best clustering results.By comparing with other categorical clustering algorithms,the results demonstrated the advantage of the new algorithm on efficiency,entropy measure,category utility(CU) and the number of cluster on synthetic data sets and real data sets,such as Votes、Mushroom and Soybean.
What problem does this paper attempt to address?