Using Platt's scaling for calibration after undersampling -- limitations and how to address them

Nathan Phelps,Daniel J. Lizotte,Douglas G. Woolford

2024-10-23

Abstract:When modelling data where the response is dichotomous and highly imbalanced, response-based sampling where a subset of the majority class is retained (i.e., undersampling) is often used to create more balanced training datasets prior to modelling. However, the models fit to this undersampled data, which we refer to as base models, generate predictions that are severely biased. There are several calibration methods that can be used to combat this bias, one of which is Platt's scaling. Here, a logistic regression model is used to model the relationship between the base model's original predictions and the response. Despite its popularity for calibrating models after undersampling, Platt's scaling was not designed for this purpose. Our work presents what we believe is the first detailed study focused on the validity of using Platt's scaling to calibrate models after undersampling. We show analytically, as well as via a simulation study and a case study, that Platt's scaling should not be used for calibration after undersampling without critical thought. If Platt's scaling would have been able to successfully calibrate the base model had it been trained on the entire dataset (i.e., without undersampling), then Platt's scaling might be appropriate for calibration after undersampling. If this is not the case, we recommend a modified version of Platt's scaling that fits a logistic generalized additive model to the logit of the base model's predictions, as it is both theoretically motivated and performed well across the settings considered in our study.

Methodology,Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: when dealing with highly imbalanced binary - classification data, the effectiveness and limitations of using Platt's scaling method to calibrate the undersampled model. ### Problem Background In many practical applications, such as in the fields of finance, medical care, and wildfire prediction, highly imbalanced binary - classification problems are often encountered. In such cases, the positive - class (minority - class) samples are far fewer than the negative - class (majority - class) samples. To deal with this imbalance problem, a common method is undersampling, that is, randomly extracting a part of samples from the majority class to create a more balanced training set. However, this method will lead to a serious deviation in the model's prediction results because the distribution of the training data is different from that of the new data. ### Core Problems 1. **Applicability of Platt's scaling**: - Platt's scaling is a commonly used calibration method, which adjusts the model's predicted probabilities by fitting a logistic regression model. However, Platt's scaling was not originally designed for calibrating undersampled models. - The paper analyzes the effectiveness of Platt's scaling in calibrating undersampled models and points out that it cannot provide accurate probability estimates in some cases. 2. **Improvement methods**: - The paper proposes an improved Platt's scaling method, that is, fitting a generalized additive model (GAM) on the logit scale to better adapt to non - linear relationships. - In addition, the paper also explores the possibility of performing a logit transformation on the original prediction before using Platt's scaling to improve the calibration effect. ### Research Contributions - **Theoretical analysis**: Through theoretical derivation, the paper proves that Platt's scaling cannot provide correct calibration results in some cases (such as when the model perfectly fits the undersampled data). - **Experimental verification**: Through simulation studies and case analyses, the paper shows the effects of different calibration methods and provides suggestions on how to choose appropriate calibration methods in practice. ### Summary The main purpose of this paper is to evaluate the performance of Platt's scaling in calibrating undersampled models and propose improved methods. The research shows that in some cases, the traditional Platt's scaling may not be directly suitable for calibrating undersampled models, and the improved methods can significantly improve the calibration effect.

Using Platt's scaling for calibration after undersampling -- limitations and how to address them

Online Platt Scaling with Calibeating

Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk

Calibration of Machine Learning Classifiers for Probability of Default Modelling

A constrained maximum likelihood approach to developing well-calibrated models for predicting binary outcomes

Conformalized Survival Distributions: A Generic Post-Process to Increase Calibration

Calibration methods in imbalanced binary classification

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

Towards Fair and Calibrated Models

Scaling of Class-wise Training Losses for Post-hoc Calibration

Obtaining Calibrated Probabilities from Boosting

Calibrated Model Criticism Using Split Predictive Checks

Risk prediction models for discrete ordinal outcomes: calibration and the impact of the proportional odds assumption

Training with Scaled Logits to Alleviate Class-level Over-fitting in Few-shot Learning

A Hitchhiker's Guide to Scaling Law Estimation

Scaling up Data Augmentation MCMC via Calibration

Calibrating Where It Matters: Constrained Temperature Scaling

On an improvement of LASSO by scaling

On Calibrating Semantic Segmentation Models: Analyses and An Algorithm

Fair admission risk prediction with proportional multicalibration

On the Limitations of Temperature Scaling for Distributions with Overlaps