Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks

Wenbin Zhu,Runwen Qiu,Ying Fu
2024-01-18
Abstract:Categorical variables often appear in datasets for classification and regression tasks, and they need to be encoded into numerical values before training. Since many encoders have been developed and can significantly impact performance, choosing the appropriate encoder for a task becomes a time-consuming yet important practical issue. This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs, such as multi-layer perceptron neural network; 2) Tree-based models that are based on decision trees, such as random forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data. We also explain why the target encoder and its variants are the most suitable encoders for tree-based models. This study conducted comprehensive computational experiments to evaluate 14 encoders, including one-hot and target encoders, along with eight common machine-learning models on 28 datasets. The computational results agree with our theoretical analysis. The findings in this study shed light on how to select the suitable encoder for data scientists in fields such as fraud detection, disease diagnosis, etc.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is about how to select an appropriate category variable encoder (category encoder) to improve the performance of machine - learning models in classification and regression tasks. Specifically, through theoretical analysis and extensive experiments, the paper explores the applicability and performance of different types of encoders on different types of models, especially for data sets with high - cardinality category variables. The paper aims to provide a reliable guiding principle to help data scientists select suitable encoders more effectively when dealing with practical problems such as fraud detection and disease diagnosis. ### Main contributions of the paper: 1. **Theoretical analysis**: - **One - Hot Encoder**: The paper proves that the One - Hot Encoder is the optimal choice for models performing affine transformations (such as multi - layer perceptron neural networks, linear regression, logistic regression, etc.), because it can simulate any other encoder given sufficient data. - **Target Encoder**: The paper explains why the Target Encoder and its variants are particularly suitable for tree - based models (such as random forests, gradient - boosted decision trees). 2. **Experimental verification**: - The paper conducted comprehensive computational experiments to evaluate the performance of 14 different encoders on 8 common machine - learning models, using 28 different data sets. - The experimental results are consistent with the theoretical analysis, further verifying the superiority of the One - Hot Encoder in ATI models and the effectiveness of the Target Encoder in tree - based models. ### Research background: - **Importance of category variables**: Category variables are very common in many practical problems, such as gender, education level, city, etc. These variables need to be encoded into numerical forms to be processed by most machine - learning models. - **Challenge of encoder selection**: Different encoders have a great impact on model performance, so selecting an appropriate encoder is an important practical problem. ### Theoretical analysis: 1. **One - Hot Encoder as a universal encoder**: - For models performing affine transformations (ATI models), the paper proves that the One - Hot Encoder can simulate any other encoder by learning appropriate weights. - Formula representation: Let \( \phi \) be an arbitrary encoder, \( \phi_{\text{OH}} \) be the One - Hot Encoder, and \( W_\phi \) and \( W_{\text{OH}} \) be the corresponding weight matrices respectively. Then there exists \( W_{\text{OH}} \) such that: \[ W_{\text{OH}} \phi_{\text{OH}}(x_1) = W_\phi \phi(x_1) \quad \forall x_1 \] 2. **Advantages of the Target Encoder in tree - based models**: - The Target Encoder encodes category variables by estimating the conditional mean of the target variable under each category level. - For tree - based models, the Target Encoder can retain the optimal split points, thereby improving model performance. - Formula representation: Let \( \phi_M \) be the Target Encoder, and \( \phi_M(v_i) \) represent the average value of the target variable \( y \) under the category level \( v_i \). Then: \[ \phi_M(v_i) = \frac{1}{|D_{v_i}|} \sum_{(x, y) \in D_{v_i}} y \] where \( D_{v_i} \) is the set of samples with the category level \( v_i \) in the training set. ### Experimental results: - **Impact of data sufficiency**: Through experiments on synthetic data and natural data sets, the paper shows the impact of data sufficiency (ASPL) on encoder performance. As ASPL increases, the performance of the One - Hot Encoder and the Target Encoder gradually approaches that of the optimal encoder. - **Performance of different models**: The experimental results show that the One - Hot Encoder performs best in ATI models, while the Target Encoder performs best in tree - based models. ### Conclusion: The paper through theoretical analysis