Validation of a Deep Learning Model for Traumatic Brain Injury Detection and NIRIS Grading on Non-Contrast CT: a Multi-Reader Study with Promising Results and Opportunities for Improvement.
Bin Jiang,Burak Berksu Ozkara,Sean Creeden,Guangming Zhu,Victoria Y. Ding,Hui Chen,Bryan Lanzman,Dylan Wolman,Sara Shams,Austin Trinh,Ying Li,Alexander Khalaf,Jonathon J. Parker,Casey H. Halpern,Max Wintermark
DOI: https://doi.org/10.1007/s00234-023-03170-5
2023-01-01
Neuroradiology
Abstract:This study aimed to assess and externally validate the performance of a deep learning (DL) model for the interpretation of non-contrast computed tomography (NCCT) scans of patients with suspicion of traumatic brain injury (TBI). This retrospective and multi-reader study included patients with TBI suspicion who were transported to the emergency department and underwent NCCT scans. Eight reviewers, with varying levels of training and experience (two neuroradiology attendings, two neuroradiology fellows, two neuroradiology residents, one neurosurgery attending, and one neurosurgery resident), independently evaluated NCCT head scans. The same scans were evaluated using the version 5.0 of the DL model icobrain tbi. The establishment of the ground truth involved a thorough assessment of all accessible clinical and laboratory data, as well as follow-up imaging studies, including NCCT and magnetic resonance imaging, as a consensus amongst the study reviewers. The outcomes of interest included neuroimaging radiological interpretation system (NIRIS) scores, the presence of midline shift, mass effect, hemorrhagic lesions, hydrocephalus, and severe hydrocephalus, as well as measurements of midline shift and volumes of hemorrhagic lesions. Comparisons using weighted Cohen’s kappa coefficient were made. The McNemar test was used to compare the diagnostic performance. Bland–Altman plots were used to compare measurements. One hundred patients were included, with the DL model successfully categorizing 77 scans. The median age for the total group was 48, with the omitted group having a median age of 44.5 and the included group having a median age of 48. The DL model demonstrated moderate agreement with the ground truth, trainees, and attendings. With the DL model’s assistance, trainees’ agreement with the ground truth improved. The DL model showed high specificity (0.88) and positive predictive value (0.96) in classifying NIRIS scores as 0–2 or 3–4. Trainees and attendings had the highest accuracy (0.95). The DL model’s performance in classifying various TBI CT imaging common data elements was comparable to that of trainees and attendings. The average difference for the DL model in quantifying the volume of hemorrhagic lesions was 6.0 mL with a wide 95