Commonly used software tools produce conflicting and overly-optimistic AUPRC values

Wenyu Chen,Chen Miao,Zhenghao Zhang,Cathy Sin-Hang Fung,Ran Wang,Yizhen Chen,Yan Qian,Lixin Cheng,Kevin Y. Yip,Stephen Kwok-Wing Tsui,Qin Cao
DOI: https://doi.org/10.1186/s13059-024-03266-y
IF: 17.906
2024-05-14
Genome Biology
Abstract:The precision-recall curve (PRC) and the area under the precision-recall curve (AUPRC) are useful for quantifying classification performance. They are commonly used in situations with imbalanced classes, such as cancer diagnosis and cell type annotation. We evaluate 10 popular tools for plotting PRC and computing AUPRC, which were collectively used in more than 3000 published studies. We find the AUPRC values computed by the tools rank classifiers differently and some tools produce overly-optimistic results.
genetics & heredity,biotechnology & applied microbiology
What problem does this paper attempt to address?
The paper attempts to address the issue of inconsistencies and overly optimistic results produced by different software tools when plotting Precision-Recall Curves (PRC) and calculating the Area Under the PRC (AUPRC). Specifically, the authors evaluated 10 commonly used software tools that have been employed in over 3000 published studies. The study found that these tools, due to the different methods they use for classifier performance evaluation, result in varying rankings of AUPRC values, with some tools even producing overly optimistic results. ### Main Issues Include: 1. **Method Differences**: Different tools use different methods to connect anchor points on the PRC, leading to different AUPRC values. 2. **Overly Optimistic Results**: Some tools use linear interpolation when handling cases with the same classifier scores, which can result in overly optimistic AUPRC values. 3. **Conceptual and Implementation Issues**: Some tools have conceptual errors in calculating AUPRC, such as always starting the PRC at (0, 1) or not generating a complete PRC that covers the full recall range from zero to one. 4. **Visualization Issues**: Some tools also have issues in generating PRC visualizations, such as not always starting the curve from zero recall or always starting from (0, 1). ### Research Background: - **Precision-Recall Curve (PRC)** and **Area Under the PRC (AUPRC)** are important metrics for measuring classifier performance, especially sensitive when dealing with imbalanced datasets. - **Application Areas**: These metrics are widely used in biological and medical research, such as biological network reconstruction, cancer gene identification, and cell type annotation. ### Research Objectives: - **Evaluate Common Tools**: Evaluate the performance of 10 commonly used tools in plotting PRC and calculating AUPRC. - **Reveal Issues**: Reveal inconsistencies and potential issues of these tools in practical applications. - **Provide Recommendations**: Offer improvement suggestions to avoid reporting overly optimistic AUPRC values and introducing evaluation biases. Through this study, the authors hope to raise awareness among researchers about these issues and promote the development of more reliable and consistent tools for evaluating classifier performance.