Diagnostic Performance of Convolutional Neural Network-Based Tanner-Whitehouse 3 Bone Age Assessment System.
Xue-Lian Zhou,Er-Gang Wang,Qiang Lin,Guan-Ping Dong,Wei Wu,Ke Huang,Can Lai,Gang Yu,Hai-Chun Zhou,Xiao-Hui Ma,Xuan Jia,Lei Shi,Yong-Sheng Zheng,Lan-Xuan Liu,Da Ha,Hao Ni,Jun Yang,Jun-Fen Fu
DOI: https://doi.org/10.21037/qims.2020.02.20
2020-01-01
Quantitative Imaging in Medicine and Surgery
Abstract:Background: Bone age can reflect the true growth and development status of a child; thus, it plays a critical role in evaluating growth and endocrine disorders. This study established and validated an optimized TannerWhitehouse 3 artificial intelligence (TW3-AI) bone age assessment (BAA) system based on a convolutional neural network (CNN). Methods: A data set of 9,059 clinical radiographs of the left hand was obtained from the picture archives and communication systems (PACS) between January 2012 and December 2016. Among these, 8,005/9,059 (88%) samples were treated as the training set for model implementation, 804/9,059 (9%) samples as the validation set for parameters optimization, and the remaining 250/9,059 (3%) samples were used to verify the accuracy and reliability of the model compared to that of 4 experienced endocrinologists and 2 experienced radiologists. The overall variation of TW3-metacarpophalangeal, radius, ulna and short bones (RUS) and TW3-Carpal bone score, as well as each bone (13 RUS + 7 Carpal) between reviewers and the AI, were compared by Bland-Altman (BA) chart and Kappa test, respectively. Furthermore, the time consumption between the model and reviewers was also compared. Results: The performance of TW3-AI model was highly consistent with the reviewers' overall estimation, and the root mean square (RMS) was 0.50 years. The accuracy of the BAA of the TW3-AI model was better than the estimate of the reviewers. Further analysis revealed that human interpretations of the male capitate, hamate, the first distal and fifth middle phalanx and female capitate, the trapezoid, and the third and fifth middle phalanx, were most inconsistent. The average image processing time was 1.5 +/- 0.2 s in the TW3-AI model, which was significantly shorter than manual interpretation. Conclusions: The diagnostic performance of CNN-based TW3 BAA was accurate and timesaving, and possesses better stability compared to diagnostics made by experienced experts.