Low Dimensional Representation of Space Structure and Clustering of Categorical Data

Jianjun Cao,Qibin Zheng,Xingchun Diao,Nianfeng Weng
DOI: https://doi.org/10.1109/bdcloud.2018.00161
2018-01-01
Abstract:Dissimilarity measurement plays a key role in clustering analysis. Due to the lack of order relation between categorical values, the clustering of categorical data is harder than that of numerical data. To improve the clustering quality of categorical data, SBC (space structure based clustering) algorithm proposed a novel representation scheme for the space structures of them. The representation scheme improved the discriminability of categorical data, while caused problems either: low-efficiency and high-dimensionality. In this work, we prove that it is possible to represent categorical data with the space structure more efficiently while maintaining the same clustering performance. To achieve that, a fraction of representative objects is selected as the reference set, with which a low-dimensional space structure matrix would be built. Since the reference set directly affect the dissimilarity measure, a cluster-based method is proposed to get better reference set. The theoretical and experimental proofs show that, compared with SBC method, the proposed methods are more efficient and extendable maintaining the approximately same clustering performance.
What problem does this paper attempt to address?