Abstract:In binary classification tasks, accurate representation of probabilistic predictions is essential for various real-world applications such as predicting payment defaults or assessing medical risks. The model must then be well-calibrated to ensure alignment between predicted probabilities and actual outcomes. However, when score heterogeneity deviates from the underlying data probability distribution, traditional calibration metrics lose reliability, failing to align score distribution with actual probabilities. In this study, we highlight approaches that prioritize optimizing the alignment between predicted scores and true probability distributions over minimizing traditional performance or calibration metrics. When employing tree-based models such as Random Forest and XGBoost, our analysis emphasizes the flexibility these models offer in tuning hyperparameters to minimize the Kullback-Leibler (KL) divergence between predicted and true distributions. Through extensive empirical analysis across 10 UCI datasets and simulations, we demonstrate that optimizing tree-based models based on KL divergence yields superior alignment between predicted scores and actual probabilities without significant performance loss. In real-world scenarios, the reference probability is determined a priori as a Beta distribution estimated through maximum likelihood. Conversely, minimizing traditional calibration metrics may lead to suboptimal results, characterized by notable performance declines and inferior KL values. Our findings reveal limitations in traditional calibration metrics, which could undermine the reliability of predictive models for critical decision-making.

A Bayesian Hierarchical Model for Comparing Average F1 Scores

Bayesian Performance Comparison of Text Classifiers

A Technique For Improving The Performance Of Naive Bayes Text Classification

Bayesian model comparison with the Hyvärinen score: computation and consistency

Bayes Test of Precision, Recall, and F-1 Measure for Comparison of Two Natural Language Processing Models

Predictive Performance of Bayesian Stacking in Multilevel Education Data

Does a Bayesian Approach Generate Robust Forecasts? Evidence from Applications in Portfolio Investment Decisions

A Deep Learning Method for Comparing Bayesian Hierarchical Models

Aligning Multiclass Neural Network Classifier Criterion with Task Performance via $F_β$-Score

Evaluating performance of multiple Bayes classifier based on AUC method

A Bayesian Model for Forecasting Hierarchically Structured Time Series

Improving Tree Augmented Naive Bayes for Class Probability Estimation

A Study of Classification Based on Bayes Classifiers

The application of Bayesian method in learning effect evaluation

Bayesian Synthesis: Combining subjective analyses, with an application to ozone data

Bayesian Aggregation

Bayesian Model Comparison Via Path-Sampling Sequential Monte Carlo.

A Study on Bayesian Improvement on Credit Risk Models

Probabilistic Scores of Classifiers, Calibration is not Enough

Researches on the Method of Bayesian Models Selections and Averages

A Bayesian model averaging method for software reliability modeling and assessment