Abstract:In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring $\Omega(n^\omega)$ time, where $\omega > 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problem of testing model calibration in near - linear time. Specifically, the paper focuses on how to distinguish from samples whether a distribution $D$ is fully calibrated or whether it differs from the calibrated state by $\epsilon$. This is an important statistical property in the literature of machine learning and decision - making, especially in the output of binary prediction models. However, the algorithmic aspects of measuring model calibration have been relatively less explored. To this end, the authors propose a new framework to study the calibration problem, that is, to define the calibration testing problem through the perspective of property testing, and design an almost linear - time algorithm to solve this problem. ### Core Problems of the Paper - **Calibration Testing Problem**: Given $n$ independently and identically distributed samples $(v, y)$ drawn from distribution $D$, where $v$ is the predicted value and $y$ is the actual result, the goal is to distinguish between the following two cases: - $D$ is fully calibrated, i.e., $d_{CE}(D)=0$. - $D$ differs from the calibrated state by $\epsilon$, i.e., $d_{CE}(D)\geq\epsilon$. ### Main Contributions of the Paper 1. **Proposing the Calibration Testing Problem**: The authors formalize the calibration testing problem for the first time and define an $\epsilon$-calibration tester. 2. **Designing a Near - Linear - Time Algorithm**: The authors design an algorithm with a running time of $O(n\log^{2}(n))$ to solve the $\epsilon$-calibration testing problem, which is more efficient than existing linear programming solvers (requiring $\Omega(n^{\omega})$ time, where $\omega > 2$ is the exponent of matrix multiplication). 3. **Handling the Tolerance Calibration Testing Problem**: The authors also design an algorithm to solve the tolerance calibration testing problem, that is, allowing a certain error threshold in the "yes" and "no" cases. 4. **Experimental Verification**: The effectiveness and efficiency of the proposed algorithm on actual data are verified through experiments. ### Core Concepts - **Calibration**: A prediction - result distribution $D$ is considered calibrated if for all $t\in[0, 1]$, $E_{(v, y)\sim D}[y\mid v = t]=t$. - **Lower Distance to Calibration (LDTC)**: Defined as $d_{CE}(D)=\inf_{\Pi\in\text{ext}(D)}E_{(u, v, y)\sim\Pi}|u - v|$, where $\text{ext}(D)$ is the set of all joint distributions $\Pi$ such that the marginal distribution of $(v, y)$ is $D$ and $(u, y)$ is fully calibrated. - **Smooth Calibration Error (smCE)**: Defined as $smCE(D)=\sup_{w\in W}\left|E_{(v, y)\sim D}[(y - v)w(v)]\right|$, where $W$ is the set of all Lipschitz functions $w: [0, 1]\to[-1, 1]$. ### Conclusion This paper provides a new method for evaluating and testing model calibration by introducing new calibration testing problems and designing efficient algorithms. These methods not only have theoretical advantages but also show good performance in practical applications.

Testing Calibration in Nearly-Linear Time

Scheduling with Variable-Length Calibrations: Two Agreeable Variants.

Weighted Throughput Maximization with Calibrations.

A Unifying Theory of Distance from Calibration

Calibration Error for Decision Making

On Computationally Efficient Multi-Class Calibration

On the Distance from Calibration in Sequential Prediction

Reassessing How to Compare and Improve the Calibration of Machine Learning Models

Truthfulness of Calibration Measures

Calibration by Distribution Matching: Trainable Kernel Calibration Metrics

Calibrations Scheduling Problem with Arbitrary Lengths and Activation Length

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

Breaking the $T^{2/3}$ Barrier for Sequential Calibration

The Calibration Generalization Gap

Field-aware Calibration: A Simple and Empirically Strong Method for Reliable Probabilistic Predictions

Towards reliable predictive analytics: a generalized calibration framework

Stronger Calibration Lower Bounds via Sidestepping

Calibration through the Lens of Interpretability

Human-Aligned Calibration for AI-Assisted Decision Making

ForeCal: Random Forest-based Calibration for DNNs

Efficient Calibration for Imperfect Computer Models