Abstract:The receiver operating characteristic (ROC) curve is frequently used as a measure of accuracy of continuous markers in diagnostic tests. The area under the ROC curve (AUC) is arguably the most widely used summary index for the ROC curve. Although the small sample size scenario is common in medical tests, a comprehensive study of small sample size properties of various methods for the construction of the confidence/credible interval (CI) for the AUC has been by and large missing in the literature. In this paper, we describe and compare 29 non-parametric and parametric methods for the construction of the CI for the AUC when the number of available observations is small. The methods considered include not only those that have been widely adopted, but also those that have been less frequently mentioned or, to our knowledge, never applied to the AUC context. To compare different methods, we carried out a simulation study with data generated from binormal models with equal and unequal variances and from exponential models with various parameters and with equal and unequal small sample sizes. We found that the larger the true AUC value and the smaller the sample size, the larger the discrepancy among the results of different approaches. When the model is correctly specified, the parametric approaches tend to outperform the non-parametric ones. Moreover, in the non-parametric domain, we found that a method based on the Mann–Whitney statistic is in general superior to the others. We further elucidate potential issues and provide possible solutions to along with general guidance on the CI construction for the AUC when the sample size is small. Finally, we illustrate the utility of different methods through real life examples.

The curious case of the test set AUROC

A Closer Look at AUROC and AUPRC under Class Imbalance

Area under the ROC Curve has the Most Consistent Evaluation for Binary Classification

A comparison of confidence/credible interval methods for the area under the ROC curve for continuous diagnostic tests with small sample size

Nonparametric receiver operating characteristic curve analysis with an imperfect gold standard

OpenAUC: Towards AUC-Oriented Open-Set Recognition

Comparing multi-class classifier performance by multi-class ROC analysis: A nonparametric approach

Schroedinger's Threshold: When the AUC doesn't predict Accuracy

Does it pay to optimize AUC?

A Modified AUC for Training Convolutional Neural Networks: Taking Confidence Into Account

Interval Estimation for the Difference in Paired Areas under the ROC Curves in the Absence of a Gold Standard Test

A Non-Parametric Method for the Comparison of Partial Areas under ROC Curves and Its Application to Large Health Care Data Sets.

Interpretation of the Area Under the ROC Curve for Risk Prediction Models

On Fixing the Right Problems in Predictive Analytics: AUC Is Not the Problem

Empirical Comparison of Area under ROC curve (AUC) and Mathew Correlation Coefficient (MCC) for Evaluating Machine Learning Algorithms on Imbalanced Datasets for Binary Classification

The receiver operating characteristic curve accurately assesses imbalanced datasets

Reducing the overfitting in the gROC curve estimation

An efficient variance estimator of AUC and its applications to binary classification

Small-sample precision of ROC-related estimates

Overcoming Common Flaws in the Evaluation of Selective Classification Systems

AVC: Selecting discriminative features on basis of AUC by maximizing variable complementarity