Abstract:In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings. Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios. Although various calibration (error) estimators exist in the current literature, there is a lack of guidance on selecting the appropriate estimator and tuning its hyperparameters. By leveraging the bilinear structure of squared calibration errors, we reformulate calibration estimation as a regression problem with independent and identically distributed (i.i.d.) input pairs. This reformulation allows us to quantify the performance of different estimators even for the most challenging calibration criterion, known as canonical calibration. Our approach advocates for a training-validation-testing pipeline when estimating a calibration error on an evaluation dataset. We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on standard image classification tasks.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problems of selection and optimization of calibration error estimators in classification models. Specifically, the author proposes a risk function based on the mean - squared error, which is used to compare and optimize the performance of different calibration error estimators in practical scenarios. The following are the main problems and goals of the paper: 1. **Improve the reliability of classification models**: - Many modern classifiers (such as deep neural networks) tend to be over - confident when making predictions, which will affect the reliability and interpretability of the model. Especially in sensitive decision - making scenarios, such as medical, autonomous driving, weather forecasting and financial decision - making, reliable predictions are particularly important. 2. **Quantify the calibration error of the model**: - Calibration error is used to measure the deviation between the predicted probability of the model and the real result. Existing calibration error estimators usually have problems of bias and inconsistency, lack of theoretical derivation and are difficult to interpret. 3. **Select and optimize calibration error estimators**: - There are multiple calibration error estimators in the current literature, but there is a lack of guidance on how to select appropriate estimators and their hyper - parameters. The paper proposes a new risk function that can be used to select and optimize different calibration error estimators. 4. **Handle complex calibration standards**: - Especially for "canonical calibration", which is a very strict calibration standard, existing methods are difficult to estimate effectively. The paper proposes a new solution by transforming the calibration estimation problem into a regression problem and using independently and identically distributed input pairs. ### Main contributions 1. **Propose a new risk function**: - A risk function based on the mean - squared error is proposed, which is suitable for estimators of squared calibration error. This risk function can be used to compare and select different estimators in any practical scenario, and is applicable to all types of calibration standards, including canonical calibration. 2. **Construct a calibration evaluation pipeline**: - Based on the proposed risk function, a training - validation - testing pipeline is constructed to estimate calibration error on the evaluation dataset. This pipeline can optimize existing calibration error estimators. 3. **Introduce a new kernel ridge regression estimator**: - A new calibration error estimator based on kernel ridge regression is proposed and compared with the optimized baseline estimator in common image classification tasks. ### Background knowledge The paper details the background knowledge, including the basics of measure theory, mean - squared - error risk minimization, the concept of calibration and different calibration standards. In particular, the paper discusses the definitions and estimation methods of canonical calibration, top - label confidence calibration and other calibration standards. ### Methods 1. **Definition and properties of the risk function**: - A new calibration - estimation risk function is defined and it is proved that this risk function can distinguish the correct solution. 2. **Calibration estimation under limited data**: - An unbiased and consistent risk estimator is proposed, and a training - validation - testing pipeline similar to that of traditional machine - learning models is constructed to select and optimize calibration error estimators. 3. **New kernel ridge regression estimators**: - Two new calibration error estimators based on kernel ridge regression are introduced and closed - form solutions are provided. ### Experiments The paper verifies the effectiveness of the proposed method through simulation experiments and actual image classification tasks. The experimental results show that the new kernel ridge regression estimator outperforms the existing baseline estimators in performance. ### Conclusions The paper successfully solves the problems of selection and optimization of calibration error estimators in classification models, provides a new risk function and evaluation pipeline, and provides strong support for improving the reliability and interpretability of the model.

Optimizing Estimators of Squared Calibration Errors in Classification

A Consistent and Differentiable Lp Canonical Calibration Error Estimator

Consistent and Asymptotically Unbiased Estimation of Proper Calibration Errors

Risk-based Calibration for Probabilistic Classifiers

Better Uncertainty Calibration via Proper Scores for Classification and Beyond

A Confidence Interval for the $\ell_2$ Expected Calibration Error

Optimizing Calibration by Gaining Aware of Prediction Correctness

Cautious Calibration in Binary Classification

ESD: Expected Squared Difference as a Tuning-Free Trainable Calibration Measure

Data-driven calibration of linear estimators with minimal penalties

Minimum-Risk Recalibration of Classifiers

From Uncertainty to Precision: Enhancing Binary Classifier Performance through Calibration

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

Two Sides of Miscalibration: Identifying Over and Under-Confidence Prediction for Network Calibration

Calibration by Distribution Matching: Trainable Kernel Calibration Metrics

Towards Unbiased Calibration using Meta-Regularization

Verified Uncertainty Calibration

Evaluating and Calibrating Uncertainty Prediction in Regression Tasks

Accurate Uncertainties for Deep Learning Using Calibrated Regression

The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration

Classifier Calibration: with application to threat scores in cybersecurity