From Categorical to Numerical: Multiple Transitive Distance Learning and Embedding.

Kai Zhang,Qiaojun Wang,Zhengzhang Chen,Ivan Marsic,Vipin Kumar,Guofei Jiang,Jie Zhang
DOI: https://doi.org/10.1137/1.9781611974010.6
2015-01-01
Abstract:Categorical data are ubiquitous in real-world databases. However, due to the lack of an intrinsic proximity measure, many powerful algorithms for numerical data analysis may not work well on their categorical counterparts, making it a bottleneck in practical applications. In this paper, we propose a novel method to transform categorical data to numerical representations, so that abundant numerical learning methods can be exploited in categorical data mining. Our key idea is to learn a pairwise dissimilarity among categorical symbols, henceforth a continuous embedding, which can then be used for subsequent numerical treatment. There are two important criteria for learning the dissimilarities. First, it should capture the important “transitivity” which has shown to be particularly useful in measuring the proximity relation in categorical data. Second, the pairwise sample geometry arising from the learned symbol distances should be maximally consistent with prior knowledge (e.g., class labels) to obtain a good generalization performance. We achieve them through multiple transitive distance learning and embedding. Encouraging results are observed on a number of benchmark classification tasks against state-of-the-art.
What problem does this paper attempt to address?