Abstract:INTRODUCTION Subchondral bone marrow lesions (BMLs) are associated with symptoms and structural progression of knee OA. Automated detection of BMLs using deep learning approaches may help in screening potential participants in clinical trials and enrich study samples for fast structural progressors or for a specific pain phenotype. The metric commonly used to evaluate the performance of deep learning binary classification (i.e., receiver operating characteristic (ROC)) might not be as informative as other metrics of performance, especially when the underlying data used to train and validate the deep learning models are imbalanced as in the case when the outcomes of interest are rare. OBJECTIVE To compare the evaluation of deep learning binary classification of BMLs based on imbalanced data from the OAI study using various performance metrics. METHODS We used the sagittal intermediate-weighted (IW) fat-suppressed (FS) MRI data of 2,467 participants from the OAI study in the data analysis. We dichotomized the MOAKS (MRI Osteoarthritis Knee Score) BML grades (scored from 0-3) into presence or absence classes. The split was done by categorizing grades > 0 as presence and grades = 0 as absence. After the deep learning models were trained, we obtained the status of BMLs from MRI images on each of 13 subregions in femur and tibia (e.g., Femur Central Medial (FemCentMed), Tibia Anterior Lateral (TibAntLat), Tibia Posterior Medial (TibPostMed)). We applied ROC, precision-recall (PR), precision-recall gain (PRG), F1, and the Matthews correlation coefficient (MCC) to summarize the prediction performance of the deep learning models using the test data. RESULTS The available MOAKS data from the OAI are imbalanced. The class imbalance ratios (i.e., presence of BMLs vs absence of BMLs) are 569:2427, 49:2947, and 191:2805 in the FemCentMed, TibAntLat, and TibPostMed, respectively. When the data are this severely imbalanced, metrics such as the area under the ROC curve (ROC-AUC) and PR-AUC show conflicting performance results in TibAntLat and TibPostMed (see Table 1). In general, a binary classifier with a ROC-AUC value of 0.8 to 0.9 is considered excellent and has an outstanding performance with a value of more than 0.9. The ROC metric (ROC-AUC = 0.84) is too optimistic since the precision and sensitivity are nearly zero, indicating that almost all data are assigned to the absence of BMLs class. The PR curve (PR-AUC = 0.10) is more informative compared to the ROC as it is consistent with the values of precision and sensitivity. The MCC and F1 results are also consistent with those of the PR curve for high- (TibAntLat) or low-class (FemCentMed) imbalance ratios. CONCLUSION The class imbalance ratio coupled with results of the ROC, PR, and MCC should be reported for deep learning models of binary classification, particularly in the circumstance where the underlying data are imbalanced. To properly interpret the prediction performance of deep learning models of binary classification, an expanded set of performance metrics should be reported. SPONSOR None. DICLOSURE STATEMENT AG is consultant to Pfizer, Novartis, Regeneron, TissueGene, Merck Serono, and AstraZeneca. AG and FWR are shareholders of BICL, LLC. FWR is consultant to Calibr –California Institute of Biomedical Research and Grunenthal. KK is consultant to Regeneron, LG Chem, and Express Scripts. He is principal investigator for pharma sponsored clinical trials to Abbvie, Cumberland, and GSK and DSMB to Kolon TissueGene and Avalor Therapeutics.

Classification performance assessment for imbalanced multiclass data

Empirical analysis of performance assessment for imbalanced classification

The MCC-F1 curve: a performance evaluation technique for binary classification

The receiver operating characteristic curve accurately assesses imbalanced datasets

Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

Evaluating classifier performance with highly imbalanced Big Data

Decision Curve Analysis: a Technical Note

Appropriateness of Performance Indices for Imbalanced Data Classification: An Analysis

Class Weights Random Forest Algorithm for Processing Class Imbalanced Medical Data

Imbalanced class distribution and performance evaluation metrics: A systematic review of prediction accuracy for determining model performance in healthcare systems

Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance

IMLBoost for intelligent diagnosis with imbalanced medical records

Robust performance metrics for imbalanced classification problems

A Novel Imbalanced Data Classification Method Based on Weakly Supervised Learning for Fault Diagnosis

Tuning model parameters in class‐imbalanced learning with precision‐recall curve

COMPARISON OF VARIOUS METRICS FOR EVALUATING THE PERFORMANCE OF DEEP LEARNING BINARY CLASSIFICATION, PARTICULARLY WHEN UNDERLYING IMAGING DATA ARE IMBALANCED

The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets

PCCT: Progressive Class-Center Triplet Loss for Imbalanced Medical Image Classification

Bayes Imbalance Impact Index: A Measure of Class Imbalanced Dataset for Classification Problem

Imbalanced Data Classification:A Survey and Experiments in Medical Domain