Waleed A. Yousef,Issa Traore,William Briguglio
Abstract:This paper explores the calibration of a classifier output score in binary classification problems. A calibrator is a function that maps the arbitrary classifier score, of a testing observation, onto $[0,1]$ to provide an estimate for the posterior probability of belonging to one of the two classes. Calibration is important for two reasons; first, it provides a meaningful score, that is the posterior probability; second, it puts the scores of different classifiers on the same scale for comparable interpretation. The paper presents three main contributions: (1) Introducing multi-score calibration, when more than one classifier provides a score for a single observation. (2) Introducing the idea that the classifier scores to a calibration process are nothing but features to a classifier, hence proposing expanding the classifier scores to higher dimensions to boost the calibrator's performance. (3) Conducting a massive simulation study, in the order of 24,000 experiments, that incorporates different configurations, in addition to experimenting on two real datasets from the cybersecurity domain. The results show that there is no overall winner among the different calibrators and different configurations. However, general advices for practitioners include the following: the Platt's calibrator~\citep{Platt1999ProbabilisticOutputsForSupport}, a version of the logistic regression that decreases bias for a small sample size, has a very stable and acceptable performance among all experiments; our suggested multi-score calibration provides better performance than single score calibration in the majority of experiments, including the two real datasets. In addition, expanding the scores can help in some experiments.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to explore calibration methods for classifier output scores in binary classification problems. Specifically, the role of the calibration function is to map any classifier scores of test observations to the interval [0, 1] to provide posterior probability estimates of belonging to one of the two classes. The importance of calibration lies in two aspects:
1. **Providing meaningful scores**: namely posterior probabilities, which makes the output of the classifier more interpretable.
2. **Making scores of different classifiers comparable**: By mapping the scores of different classifiers to the same scale, more reasonable comparison and interpretation can be carried out.
The main contributions of the paper include:
1. **Multi - score calibration**: When multiple classifiers provide scores for the same observation, how to combine these scores into a calibrated probability score.
2. **Exact analogy between two scenarios**:
- Design a method for generating classifiers from a feature set.
- Design a calibrator for generating a single calibrated score from a set of scores of different classifiers.
Therefore, it is proposed to extend the scores of classifiers to a higher dimension to improve the performance of the calibrator.
3. **Large - scale simulation study**: Approximately 24,000 experiments were carried out, covering different configurations, and experiments were carried out on two actual data sets from the field of network security. The results show that no calibrator performs best in all cases, but Platt's calibrator shows very stable and acceptable performance in all experiments. In addition, the proposed multi - score calibration outperforms single - score calibration in most experiments, including the two actual data sets. Extended scores also help to improve performance in some experiments.
### Formula presentation
1. **Calculation of posterior probability**:
\[
p(h)=\frac{1}{1 + L^{-1}\left(\frac{1-\pi}{\pi}\right)}
\]
where:
\[
L = \frac{f_h(h|\omega_1)}{f_h(h|\omega_0)}
\]
\[
\pi = P(\omega_1)
\]
2. **Mean squared error (MSE) and Brier score**:
\[
\text{MSE}(\hat{p}, p)=E_h\left[(\hat{p}(h)-p(h))^2\right]
\]
\[
\text{Brier}(\hat{p}, p)=E_h\left[(\hat{p}(h)-y)^2\right]
\]
where \(y\) is the true label (0 or 1) of the observation.
3. **Root mean squared error (RMSE) and root Brier score (RB)**:
\[
\text{RMSE}(\hat{p}, p)=\sqrt{\text{MSE}(\hat{p}, p)}
\]
\[
\text{RB}(\hat{p}, p)=\sqrt{\text{Brier}(\hat{p}, p)}
\]
### Conclusion
The paper verifies the effectiveness of multi - score calibration through large - scale experiments and proposes using Platt's calibrator as a stable calibration method in practical applications. These results are not only of great significance for the field of network security, but also applicable to the classifier calibration problems in the entire machine - learning community.