Abstract:Categorical variables often appear in datasets for classification and regression tasks, and they need to be encoded into numerical values before training. Since many encoders have been developed and can significantly impact performance, choosing the appropriate encoder for a task becomes a time-consuming yet important practical issue. This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs, such as multi-layer perceptron neural network; 2) Tree-based models that are based on decision trees, such as random forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data. We also explain why the target encoder and its variants are the most suitable encoders for tree-based models. This study conducted comprehensive computational experiments to evaluate 14 encoders, including one-hot and target encoders, along with eight common machine-learning models on 28 datasets. The computational results agree with our theoretical analysis. The findings in this study shed light on how to select the suitable encoder for data scientists in fields such as fraud detection, disease diagnosis, etc.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is about how to select an appropriate category variable encoder (category encoder) to improve the performance of machine - learning models in classification and regression tasks. Specifically, through theoretical analysis and extensive experiments, the paper explores the applicability and performance of different types of encoders on different types of models, especially for data sets with high - cardinality category variables. The paper aims to provide a reliable guiding principle to help data scientists select suitable encoders more effectively when dealing with practical problems such as fraud detection and disease diagnosis. ### Main contributions of the paper: 1. **Theoretical analysis**: - **One - Hot Encoder**: The paper proves that the One - Hot Encoder is the optimal choice for models performing affine transformations (such as multi - layer perceptron neural networks, linear regression, logistic regression, etc.), because it can simulate any other encoder given sufficient data. - **Target Encoder**: The paper explains why the Target Encoder and its variants are particularly suitable for tree - based models (such as random forests, gradient - boosted decision trees). 2. **Experimental verification**: - The paper conducted comprehensive computational experiments to evaluate the performance of 14 different encoders on 8 common machine - learning models, using 28 different data sets. - The experimental results are consistent with the theoretical analysis, further verifying the superiority of the One - Hot Encoder in ATI models and the effectiveness of the Target Encoder in tree - based models. ### Research background: - **Importance of category variables**: Category variables are very common in many practical problems, such as gender, education level, city, etc. These variables need to be encoded into numerical forms to be processed by most machine - learning models. - **Challenge of encoder selection**: Different encoders have a great impact on model performance, so selecting an appropriate encoder is an important practical problem. ### Theoretical analysis: 1. **One - Hot Encoder as a universal encoder**: - For models performing affine transformations (ATI models), the paper proves that the One - Hot Encoder can simulate any other encoder by learning appropriate weights. - Formula representation: Let \( \phi \) be an arbitrary encoder, \( \phi_{\text{OH}} \) be the One - Hot Encoder, and \( W_\phi \) and \( W_{\text{OH}} \) be the corresponding weight matrices respectively. Then there exists \( W_{\text{OH}} \) such that: \[ W_{\text{OH}} \phi_{\text{OH}}(x_1) = W_\phi \phi(x_1) \quad \forall x_1 \] 2. **Advantages of the Target Encoder in tree - based models**: - The Target Encoder encodes category variables by estimating the conditional mean of the target variable under each category level. - For tree - based models, the Target Encoder can retain the optimal split points, thereby improving model performance. - Formula representation: Let \( \phi_M \) be the Target Encoder, and \( \phi_M(v_i) \) represent the average value of the target variable \( y \) under the category level \( v_i \). Then: \[ \phi_M(v_i) = \frac{1}{|D_{v_i}|} \sum_{(x, y) \in D_{v_i}} y \] where \( D_{v_i} \) is the set of samples with the category level \( v_i \) in the training set. ### Experimental results: - **Impact of data sufficiency**: Through experiments on synthetic data and natural data sets, the paper shows the impact of data sufficiency (ASPL) on encoder performance. As ASPL increases, the performance of the One - Hot Encoder and the Target Encoder gradually approaches that of the optimal encoder. - **Performance of different models**: The experimental results show that the One - Hot Encoder performs best in ATI models, while the Target Encoder performs best in tree - based models. ### Conclusion: The paper through theoretical analysis

Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

Sufficient Representations for Categorical Variables

New PCA-based Category Encoder for Cybersecurity and Processing Data in IoT Devices

Fairness Implications of Encoding Protected Categorical Attributes

A Conditional-Probability Zone Transformation Coding Method for Categorical Features.

Out of (the) bag—encoding categorical predictors impacts out-of-bag samples

Target Variable Engineering

Encoding high-cardinality string categorical variables

A comparative analysis of encoder only and decoder only models in intent classification and sentiment analysis: navigating the trade-offs in model size and performance

Performance Analysis and Comparison of Neural Networks and Support Vector Machines Classifier

Learning over Categorical Data Using Counting Features

Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms

A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

Adaptive Bi-Encoder Model Selection and Ensemble for Text Classification

Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study

Feature Encoding Methods Evaluation Based On Multiple Kernel Learning

Comparative analysis of weka-based classification algorithms on medical diagnosis datasets

Comparison of different feature extraction methods for applicable automated ICD coding

Comparative Analysis of Predictive Algorithms for Performance Measurement