Abstract:The precision-recall curve (PRC) and the area under the precision-recall curve (AUPRC) are useful for quantifying classification performance. They are commonly used in situations with imbalanced classes, such as cancer diagnosis and cell type annotation. We evaluate 10 popular tools for plotting PRC and computing AUPRC, which were collectively used in more than 3000 published studies. We find the AUPRC values computed by the tools rank classifiers differently and some tools produce overly-optimistic results.

What problem does this paper attempt to address?

The paper attempts to address the issue of inconsistencies and overly optimistic results produced by different software tools when plotting Precision-Recall Curves (PRC) and calculating the Area Under the PRC (AUPRC). Specifically, the authors evaluated 10 commonly used software tools that have been employed in over 3000 published studies. The study found that these tools, due to the different methods they use for classifier performance evaluation, result in varying rankings of AUPRC values, with some tools even producing overly optimistic results. ### Main Issues Include: 1. **Method Differences**: Different tools use different methods to connect anchor points on the PRC, leading to different AUPRC values. 2. **Overly Optimistic Results**: Some tools use linear interpolation when handling cases with the same classifier scores, which can result in overly optimistic AUPRC values. 3. **Conceptual and Implementation Issues**: Some tools have conceptual errors in calculating AUPRC, such as always starting the PRC at (0, 1) or not generating a complete PRC that covers the full recall range from zero to one. 4. **Visualization Issues**: Some tools also have issues in generating PRC visualizations, such as not always starting the curve from zero recall or always starting from (0, 1). ### Research Background: - **Precision-Recall Curve (PRC)** and **Area Under the PRC (AUPRC)** are important metrics for measuring classifier performance, especially sensitive when dealing with imbalanced datasets. - **Application Areas**: These metrics are widely used in biological and medical research, such as biological network reconstruction, cancer gene identification, and cell type annotation. ### Research Objectives: - **Evaluate Common Tools**: Evaluate the performance of 10 commonly used tools in plotting PRC and calculating AUPRC. - **Reveal Issues**: Reveal inconsistencies and potential issues of these tools in practical applications. - **Provide Recommendations**: Offer improvement suggestions to avoid reporting overly optimistic AUPRC values and introducing evaluation biases. Through this study, the authors hope to raise awareness among researchers about these issues and promote the development of more reliable and consistent tools for evaluating classifier performance.

Commonly used software tools produce conflicting and overly-optimistic AUPRC values

Commonly used software tools produce conflicting and overly-optimistic AUPRC values

Precision-Recall Curve (PRC) Classification Trees

The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets

A Closer Look at AUROC and AUPRC under Class Imbalance

Tuning model parameters in class‐imbalanced learning with precision‐recall curve

The receiver operating characteristic curve accurately assesses imbalanced datasets

Development and External Multicenter Validation of Chinese Prostate Cancer Consortium Prostate Cancer Risk Calculator for Initial Prostate Biopsy.

Improving Prostate Cancer Risk Prediction Through Partial AUC Optimization

Precision and Recall Reject Curves for Classification

Optimal ROC-Based Classification and Performance Analysis under Bayesian Uncertainty Models

Small-sample precision of ROC-related estimates

PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R

Population and Empirical PR Curves for Assessment of Ranking Algorithms

Decision Curve Analysis: a Technical Note

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

Successful treatment of latrodectism with antivenin after 90 hours.

C-CREST Technique for Combinational Logic SET Testing

Precrec: fast and accurate precision–recall and ROC curve calculations in R

A Clustered Optimal ROC Curve Method for Family-Based Genetic Risk Prediction

Overcoming Common Flaws in the Evaluation of Selective Classification Systems