A Conditional-Probability Zone Transformation Coding Method for Categorical Features.

He Liang,Xu Zhengguo,Li Yun,Shen Chao
DOI: https://doi.org/10.1145/3321408.3326636
2019-01-01
Abstract:It has been a key issue for solving problems efficiently by machine learning models with code categorical features. The state-of-the-art one-hot coding is a widely accepted method to convert the categorical features into numerical values. However, it attracts a sparse space and meaningless value after coding. We come up with a novel coding method based on conditional probability after dividing the features into zones, which is called Conditional-probability-based Zone Transformation (CZT) coding. CZT coding calculates the conditional probability of each feature, then divides the features into several zones according to the probability and finally codes the features in each zone. We mathematically prove that compared with the state-of-the-art method, CZT coding reduces the code length by at least the mean of feature space and the issue becomes into an easier one after CZT coding for the following machine learning model. Finally, using the same neuron network as the classifier, we compare the performance of CZT coding and one-hot coding by using the titanic dataset, where most of the features are categorical, and the result is that CZT coding makes the classifier performs better both on the accuracy and steadiness.
What problem does this paper attempt to address?