Quantitative and Qualitative Evaluation of a Deep Learning Auto Contouring Model for Prostate Cancer Patients with Hydrogel Spacer

S. Zieminski,J. A. Efstathiou,A. L. Zietman,S. C. Kamran,Y. Wang
DOI: https://doi.org/10.1016/j.ijrobp.2020.07.711
2020-01-01
Abstract:To quantitatively and qualitatively evaluate a deep learning auto contouring model for prostate radiotherapy patients with pretreatment insertion of a hydrogel spacer (about water equivalent with no contrast) between prostate and rectum. The model employs convolutional neural networks (CNN) to learn features from input images that can be used to generate semantic segmentation. The study used 163 patients from three specialized GU radiation oncologists (referred to as A/B/C). The first 135 patients (A/B/C = 82/39/14) were used for training (125) and validation (10). The validation patients were randomly selected. The validated model was tested on 28 patients (A/B/C = 18/6/4) accrued during model development. There was no change of practice during the whole period. A simulation CT and MR were taken on the same day for each patient. In manual contouring, with MR fused to CT, spacer was contoured on T2 MR, prostate on CT with MR guidance, and other structures on CT only. The model was trained to auto contour prostate, proximal seminal vesicles (SV), bladder, rectum, penile bulb, femurs and spacer on CT without MR. Quantitatively, auto contours were evaluated against manual contours using the following metrics: sensitivity (% of voxels correctly drawn), false positive rate (FPR, % of voxels overdrawn), dice similarity coefficient (DSC), 95-percentile of Hausdorff distance (HD) and mean distances (dmean) between the two contours over all slices. The structures with high DSC were qualitatively evaluated by the original attending using a 1 (acceptable with minor editing), 2 (editable with efficiency gain over manual contouring) and 3 (rejected for no efficiency gain or gross error) scoring system. A gross error on rectum occurred for two patients (A/B = 1/1). These two points were excluded from quantitative analysis but counted as rejected in qualitative evaluation. On average, DSC was high for femurs (>0.95) and bladder (0.91), moderate for prostate (0.85) and rectum (0.81), but low for bulb (0.67), proximal SV (0.62) and spacer (0.52). For right femur/left femur/bladder/prostate/rectum, sensitivity = 0.93/0.92/0.88/0.86/0.81, FPR = 1.8%/1.5%/4.5%/15%/17%, 95% 95%-HD = 2.8/2.6/12.1/7.4/9.5 mm, and dmean = 0.9/1.0/2.6/2.5/2.4 mm. Qualitatively, femurs scored 1 in all cases. The average scores for bladder/prostate/rectum = 1.28/1.44/1.50, 1.83/2.17/1.67, 1.25/1.50/1.25 for physicians A, B, C, respectively, and 1.39/1.61/1.50 overall. Prostate and rectum both scored well below 2, despite their lower quantitative performance, as some errors caused by the inaccurate prediction of spacer without MR were deemed easily correctable by the physicians. The model produced clinically satisfactory results, both quantitatively and qualitatively, for femurs, bladder, prostate and rectum. The results for proximal SV and bulb were less ideal. The model drew the spacer in the correct location, but could not draw it accurately due to lack of contrast on CT.
What problem does this paper attempt to address?