Information-based Projection Method for Categorical Clustering and Outlier Detection

Stephen S-T. Yau,Dongmin Cai
2006-01-01
Abstract:Clustering categorical dataset is a difficult task due to the absence of the "natural" dissimilarity measurement between categorical values. In this dissertation, we present a novel solution which provides a reasonable projection from categorical attributes to numerical values by using information theory. It heuristically exploits information that each categorical attribute brings and probability that each attribute provides to convert those categorical attributes to numeric space. The method can apply not only to categorical dataset but also to mixed-type dataset. In addition to clustering analysis, our approach is helpful for detecting outliers. The capability to covert categorical values to numerical data makes many current outlier detection methods applicable. We demonstrate the effectiveness of our approach by a series of experiments on real-life datasets.
What problem does this paper attempt to address?