Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Florian Pargent,Florian Pfisterer,Janek Thomas,Bernd Bischl

DOI: https://doi.org/10.1007/s00180-022-01207-6

2022-03-04

Abstract:Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm's predictive performance, and -- if possible -- derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass- classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.

Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently encode features with high cardinality (i.e., categorical variables with a large number of distinct levels but no natural order) in supervised machine learning. Specifically, the authors study the impact of different encoding techniques on the prediction performance of subsequent machine - learning algorithms and attempt to determine best practices to guide when to use which technique. The paper focuses particularly on Regularized Target Encoding and compares it with traditional encoding methods, such as integer encoding, frequency encoding, hash encoding, leaf encoding, impact encoding, and generalized linear mixed - model encoding. Through large - scale benchmark experiments, the authors evaluate the performance of these encoding techniques in regression, binary - classification, and multi - classification settings, aiming to provide effective solutions for handling high - cardinality categorical features.

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Encoding high-cardinality string categorical variables

Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems

Fairness Implications of Encoding Protected Categorical Attributes

Learning over Categorical Data Using Counting Features

Beyond one-hot encoding: Lower dimensional target embedding

A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

Sufficient Representations for Categorical Variables

Feature Encodings for Gradient Boosting with Automunge

Variance-Covariance Regularization Improves Representation Learning

Improving deep representation learning via auxiliary learnable target coding

When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications?

Out of (the) bag—encoding categorical predictors impacts out-of-bag samples

Feature Encoding Methods Evaluation Based On Multiple Kernel Learning

An attribute-weighted isometric embedding method for categorical encoding on mixed data

A Conditional-Probability Zone Transformation Coding Method for Categorical Features.

End-to-End Feature-Aware Label Space Encoding for Multilabel Classification with Many Classes.

Target Variable Engineering

A Random-effects Approach to Regression Involving Many Categorical Predictors and Their Interactions