Validation of an Established TW3 Artificial Intelligence Bone Age Assessment System: a Prospective, Multicenter, Confirmatory Study
Yanqi Liu,Liujian Ouyang,Wei Wu,Xuelian Zhou,Ke Huang,Zhihua Wang,Cui Song,Qiuli Chen,Zhe Su,Rongxiu Zheng,Ying Wei,Wei Lu,Yang Liu,Ziye Yan,Zhaoyuan Wu,Jitao Fan,Mingzhi Zhou,Junfen Fu
DOI: https://doi.org/10.21037/qims-23-715
2023-01-01
Quantitative Imaging in Medicine and Surgery
Abstract:Background: In 2020, our center established a Tanner-Whitehouse 3 (TW3) artificial intelligence (AI) system using a convolutional neural network (CNN), which was built upon 9059 radiographs. However, the system, upon which our study is based, lacked a gold standard for comparison and had not undergone thorough evaluation in different working environments. Methods: To further verify the applicability of the AI system in clinical bone age assessment (BAA) and to enhance the accuracy and homogeneity of BAA, a prospective multi-center validation was conducted. This study utilized 744 left-hand radiographs of patients, ranging from 1 to 20 years of age, with 378 boys and 366 girls. These radiographs were obtained from nine different children's hospitals between August and December 2020. The BAAs were performed using the TW3 AI system and were also reviewed by experienced reviewers. Bone age accuracy within 1 year, root mean square error (RMSE), and mean absolute error (MAE) were statistically calculated to evaluate the accuracy. Kappa test and Bland-Altman (B-A) plot were conducted to measure the diagnostic consistency. Results: The system exhibited a high level of performance, producing results that closely aligned with those of the reviewers. It achieved a RMSE of 0.52 years and an accuracy of 94.55% for the radius, ulna, and short bones series. When assessing the carpal series of bones, the system achieved a RMSE of 0.85 years and an accuracy of 80.38%. Overall, the system displayed satisfactory accuracy and RMSE, particularly in patients over 7 years old. The system excelled in evaluating the carpal bone age of patients aged 1-6. Both the Kappa test and B-A plot demonstrated substantial consistency between the system and the reviewers, although the model encountered challenges in consistently distinguishing specific bones, such as the capitate. Furthermore, the system's performance proved acceptable across different genders and age groups, as well as radiography instruments. Conclusions: In this multi-center validation, the system showcased its potential to enhance the efficiency and consistency of healthy delivery, ultimately resulting in improved patient outcomes and reduced healthcare costs.