Abstract:Machine Learning (ML) has become pivotal across various fields, offering innovative solutions to complex data challenges. Professionals typically seek models that excel in both performance and reliability, aiming to achieve optimal generalization on future data. Since, then a variety of methods such as dummy coding, up/down-sampling, and bin-counting have been explored. However, finding a solution that effectively navigates the intricacies of limited and complex datasets still remains a challenge. This study introduces the K-Means Featurizer (KMF), an innovative algorithm crafted to enhance model performance and reliability, especially in scenarios involving complex and limited datasets. KMF employs K-Means clustering to generate enriched features that provide a nuanced understanding of the data, effectively balancing the similarity between the target variable and the feature space. This results in a more efficient predictive task by minimizing Euclidean distances and enhancing model generalizability. Our research validates KMF's effectiveness through an experiment in geoscience engineering, focusing on hydraulic conductivity (K) prediction, a vital parameter in well monitoring and infrastructure planning. Traditionally, K extraction is laborious and costly, requiring extensive pumping tests. KMF's application in this context demonstrates its potential to substantially reduce data losses during such operations. Applying KMF to the Extreme Gradient Boosting, Random Forest, K-Neighbors, Support Vector Machines, and Multiple Layers Neural Networks resulted in a significant improvement in prediction accuracy, with K-scores reaching up to 90%. While our experiment centers on geoscience engineering, KMF's utility extends to various domains facing similar data intricacies. Its adaptability to different types of complex datasets positions it as a valuable tool for diverse data-driven applications.

Feature Encodings for Gradient Boosting with Automunge

Numeric Encoding Options with Automunge

Feature Selection and Parameter Optimization for Support Vector Machines: A New Approach Based on Genetic Algorithm with Feature Chromosomes.

Structured Data Encoder for Neural Networks Based on Gradient Boosting Decision Tree.

StructureBoost: Efficient Gradient Boosting for Structured Categorical Variables

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

CatBoost: unbiased boosting with categorical features

Gradient Boosting Mapping for Dimensionality Reduction and Feature Extraction

Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation

Gradient-Boosted Based Structured and Unstructured Learning

K-Means Featurizer: A booster for intricate datasets

Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks

GBRUN: A Gradient Search-based Binary Runge Kutta Optimizer for Feature Selection

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

A comparative analysis of gradient boosting algorithms

MetAug: Contrastive Learning via Meta Feature Augmentation

Accelerating Gradient Boosting Machine

Individually Fair Gradient Boosting

Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

Optimization by gradient boosting

CMA-ES for Post Hoc Ensembling in AutoML: A Great Success and Salvageable Failure