Diagnostic performance for severity grading of hip osteoarthritis and osteonecrosis of femoral head on radiographs: Deep learning model vs. board-certified orthopaedic surgeons

Chen Chen,Peng Liu,Yong Feng,DeXian Ye,Chi-Cheng Fu,Lin Ye,YanYan Song,DongXu Liu,Guoyan Zheng,ChangQing Zhang
DOI: https://doi.org/10.1016/j.ostima.2023.100092
2023-01-01
Osteoarthritis Imaging
Abstract:To evaluate the diagnostic performance of a single deep learning (DL) model for severity grading of two typical yet challenging hip disorders, primary hip osteoarthritis (PHOA) and osteonecrosis of the femoral head (ONFH), on digital radiography. We conducted a two-center, retrospective study. We trained an XceptionNet-based DL model using a dataset consisting of 56,597 hip images diagnosed as normal, PHOA_I, PHOA_II, PHOA_III, and ONFH_II, ONFH_III, ONFH_IV by a panel of 10 board-certified orthopedic surgeons. The trained model was validated on a separate testing dataset. To demonstrate the model's generalizability, we applied the trained model directly to a dataset consisting of 811 hip images collected from an external clinical center. Accuracy, area under the curve (AUC) of receiver operating characteristics, sensitivity, and specificity were investigated. Validated on the testing dataset, the model achieved an overall AUC of 94.9%, with individual AUC scores of 94.2% for PHOA_I, 95.8% for PHOA_II, 90.9% for PHOA_III, 93.6% for ONFH_II, 93.8% for ONFH_III, and 93.8% for ONFH_IV. The average sensitivity for all classes of the DL algorithm (0.797) was better than the average level of the board-certified orthopedic surgeons (0.756). When applied directly to the external dataset, the AUC of the trained model is degraded. We can train a single DL model to grade the severity of PHOA and ONFH on digital radiographs. The model may be used to provide a second opinion for severity grading of hip disorders on digital radiographs.
What problem does this paper attempt to address?