Abstract:Background: For medical artificial intelligence (AI) training and validation, human expert labels are considered the gold standard that represents the correct answers or desired outputs for a given data set. These labels serve as a reference or benchmark against which the model's predictions are compared. Objective: This study aimed to assess the accuracy of a custom deep learning (DL) algorithm on classifying diabetic retinopathy (DR) and further demonstrate how label errors may contribute to this assessment in a nationwide DR-screening program. Methods: Fundus photographs from the Lifeline Express, a nationwide DR-screening program, were analyzed to identify the presence of referable DR using both (1) manual grading by National Health Service England-certificated graders and (2) a DL-based DR-screening algorithm with validated good lab performance. To assess the accuracy of labels, a random sample of images with disagreement between the DL algorithm and the labels was adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of labels in this sample were then used to correct the number of negative and positive cases in the entire data set, serving as postcorrection labels. The DL algorithm's performance was evaluated against both pre- and postcorrection labels. Results: The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the real-world performance and the lab-reported performance in this nationwide data set, with a sensitivity increase of 12.5% (from 79.6% to 92.5%, P<.001) and a specificity increase of 6.9% (from 91.6% to 98.5%, P<.001). In the random sample, 63.6% (560/880) of negative images and 5.2% (140/2710) of positive images were misclassified in the precorrection human labels. High myopia was the primary reason for misclassifying non-DR images as referable DR images, while laser spots were predominantly responsible for misclassified referable cases. The estimated label error rate for the entire data set was 1.2%. The label correction was estimated to bring about a 12.5% enhancement in the estimated sensitivity of the DL algorithm (P<.001). Conclusions: Label errors based on human image grading, although in a small percentage, can significantly affect the performance evaluation of DL algorithms in real-world DR screening.

Driving down Poisson error can offset classification error in clinical tasks

Machine Learning for Patient-Based Real-Time Quality Control (PBRTQC), Analytical and Preanalytical Error Detection in Clinical Laboratory

Machine Learning-Based Sample Misidentification Error Detection in Clinical Laboratory Tests: A Retrospective Multicenter Study

Metrics to guide development of machine learning algorithms for malaria diagnosis

Statistical Thinking, Machine Learning

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions

Deep learning uncertainty quantification for clinical text classification

A giant with feet of clay: on the validity of the data that feed machine learning in medicine

Mixed-Integer Projections for Automated Data Correction of EMRs Improve Predictions of Sepsis among Hospitalized Patients

Mitigating Diagnostic Errors in Lung Cancer Classification: A Multi-Eyes Principle to Uncertainty Quantification

Severity of error in hierarchical datasets

A machine learning case study to predict rare clinical event of interest: imbalanced data, interpretability, and practical considerations

Efficient automated error detection in medical data using deep-learning and label-clustering

Accuracy of machine learning in predicting outcomes post-percutaneous coronary intervention: a systematic review

The Dependence of Machine Learning on Electronic Medical Record Quality

The impact of inconsistent human annotations on AI driven clinical decision making

Improving the accuracy of medical diagnosis with causal machine learning

The harms of class imbalance corrections for machine learning based prediction models: a simulation study

Evaluating the Impact of Pulse Oximetry Bias in Machine Learning under Counterfactual Thinking

Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data