Deep learning for contour quality assurance for RTOG 0933: In-silico evaluation

Evan M Porter,Charles Vu,Ina M Sala,Thomas Guerrero,Zaid A Siddiqui
DOI: https://doi.org/10.1016/j.radonc.2024.110519
2024-08-31
Abstract:Purpose: To validate a CT-based deep learning (DL) hippocampal segmentation model trained on a single-institutional dataset and explore its utility for multi-institutional contour quality assurance (QA). Methods: A DL model was trained to contour hippocampi from a dataset generated by an institutional observer (IO) contouring on brain MRIs from a single-institution cohort. The model was then evaluated on the RTOG 0933 dataset by comparing the treating physician (TP) contours to blinded IO and DL contours using Dice and Haussdorf distance (HD) agreement metrics as well as evaluating differences in dose to hippocampi when TP vs. IO vs. DL contours are used for planning. The specificity and sensitivity of the DL model to capture planning discrepancies was quantified using criteria of HD > 7 mm and Dmax hippocampi > 17 Gy. Results: The DL model showed greater agreement with IO contours compared to TP contours (DL:IO L/R Dice 74 %/73 %, HD 4.86/4.74; DL:TP L/R Dice 62 %/65 %, HD 7.23/6.94, all p < 0.001). Thirty percent of contours and 53 % of dose plans failed QA. The DL model achieved an AUC L/R 0.80/0.79 on the contour QA task via Haussdorff comparison and AUC of 0.91 via Dmax comparison. The false negative rate was 17.2 %/20.5 % (contours) and 5.8 % (dose). False negative cases tended to demonstrate a higher DL:IO Dice agreement (L/R p = 0.42/0.03) and better qualitative visual agreement compared with true positive cases. Conclusion: Our study demonstrates the feasibility of using a single-institutional DL model to perform contour QA on a multi-institutional trial for the task of hippocampal segmentation.
What problem does this paper attempt to address?