Ahmed M. Alaa,Mihaela van der Schaar
Abstract:Deep learning models achieve high predictive accuracy across a broad spectrum of tasks, but rigorously quantifying their predictive uncertainty remains challenging. Usable estimates of predictive uncertainty should (1) cover the true prediction targets with high probability, and (2) discriminate between high- and low-confidence prediction instances. Existing methods for uncertainty quantification are based predominantly on Bayesian neural networks; these may fall short of (1) and (2) -- i.e., Bayesian credible intervals do not guarantee frequentist coverage, and approximate posterior inference undermines discriminative accuracy. In this paper, we develop the discriminative jackknife (DJ), a frequentist procedure that utilizes influence functions of a model's loss functional to construct a jackknife (or leave-one-out) estimator of predictive confidence intervals. The DJ satisfies (1) and (2), is applicable to a wide range of deep learning models, is easy to implement, and can be applied in a post-hoc fashion without interfering with model training or compromising its accuracy. Experiments demonstrate that DJ performs competitively compared to existing Bayesian and non-Bayesian regression baselines.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of uncertainty quantification in deep - learning models. Specifically, the authors point out that although deep - learning models exhibit high prediction accuracy in various tasks, strictly quantifying their prediction uncertainty remains a difficult problem. In order for uncertainty estimates to be practical, they should meet two key requirements:
1. **Coverage**: Cover the true prediction target with high probability.
2. **Discrimination**: Be able to distinguish between high - confidence and low - confidence prediction instances.
Existing uncertainty quantification methods are mainly based on Bayesian Neural Networks (BNNs), and these methods may not fully meet the above two requirements. For example, Bayesian credible intervals do not guarantee frequency coverage, and approximate posterior inferences can weaken discrimination performance. In addition, Bayesian methods require significant modifications to the training process, and exact Bayesian inferences are computationally too expensive in practical applications.
To solve these problems, the authors propose the **Discriminative Jackknife (DJ)**, a frequentist - based method that uses influence functions of the model loss function to construct a jackknife (or leave - one - out) estimator of the prediction confidence interval. The DJ method has the following advantages:
- Meets the requirements of coverage and discrimination.
- Is applicable to a wide range of deep - learning models.
- Is simple to implement and can be applied post - hoc without interfering with model training or compromising its accuracy.
Through experimental verification, the DJ method performs well in terms of coverage and discrimination and is competitive compared to existing Bayesian and non - Bayesian regression baseline methods.
### Summary of Mathematical Formulas
1. **Definition of Confidence Interval**:
\[
C(x; \hat{\theta}) \triangleq [f^-(x; \hat{\theta}), f^+(x; \hat{\theta})], \quad \forall x \in X
\]
where \(W(C(x; \hat{\theta})) = f^+(x; \hat{\theta}) - f^-(x; \hat{\theta})\) represents the width of the confidence interval.
2. **Requirement for Frequency Coverage**:
\[
P\{y \in C(x; \hat{\theta})\} \geq 1 - \alpha
\]
3. **Requirement for Discrimination**:
\[
E[W(C(x; \hat{\theta}))] \geq E[W(C(x'; \hat{\theta}))] \Leftrightarrow E[\ell(y, f(x; \hat{\theta}))] \geq E[\ell(y', f(x'; \hat{\theta}))]
\]
4. **Construction of DJ Confidence Interval**:
\[
\hat{C}_{DJ}^\alpha(x; \hat{\theta}) = [f^-(x; \hat{\theta}), f^+(x; \hat{\theta})]
\]
where
\[
f^\gamma(x; \hat{\theta}) = G_{\alpha, \gamma}(R, V(x)), \quad \gamma \in \{-1, +1\}
\]
\[
R \Rightarrow \text{Marginal Error}, \quad V(x) \Rightarrow \text{Local Variability}
\]
5. **Definition of Influence Function**:
\[
I^{(1)}_\theta(x_i, y_i) = \left. \frac{\partial \hat{