Comparative Clinical Evaluation of Deep-Learning-Based Algorithms in Auto-Segmentation of Organs-At-Risk for Head and Neck Cancers
A. Liu,R. Li,C. Han,D. Du,S. Sampath,A. Amini,S. M. Glaser,J. Y. C. Wong
DOI: https://doi.org/10.1016/j.ijrobp.2020.07.324
2020-01-01
Abstract:This IRB-approved study evaluated the quality of contours auto-generated by two deep learning (DL) contouring algorithms for organs-at-risk (OAR) volumes in head and neck cancers. Eleven consecutive head and neck (HandN) patients treated by Tomotherapy were selected for evaluation. Dose prescriptions ranged from 60-70Gy in 30-35 fractions. Each patient had three sets of OAR volumes generated, one clinically used and drawn by humans (physician and dosimetrist) and two auto-generated with DL contouring solutions, trained using convolutional neural network algorithms in large external datasets. The two DL models used for comparison were a HandN model (DLCExpert, Mirada Medical, Oxford, UK) and a Ua-Net model (DeepVoxel Inc, Irvine, CA). Using human-generated volumes as the ground truth, we evaluated the performance of these two models using 3 spatial overlap based metrics (Dice coefficient, Jaccard index(JAC) and True positive rate sensitivity(TPR)), 2 surface distance metrics (95% Hausdorff distance(HD) and average distance(AD)), and 1 volume matrix (volume similarity index(VS)). Seventeen common OAR structures were evaluated including brachial plexus, brainstem, esophagus, eyes, larynx, lenses, mandible, optical nerves and chiasm, oral cavity, parotids, pharyngeal constrictors (PC), submandibular glands (SMGs), spinal cord and trachea. Both DL models offered a feasible solution to delineate structures from CT images. The Mirada model had only 10 common organs for comparison. As shown in Table 1, both models produced comparable results while the DeepVoxel matched human contour better in most OARs. Different image segmentation metrics showed consistent results. DL contours were most similar to human generated contours for brainstem, esophagus, eyes, larynx, lens, mandible, parotids, SMGs, spinal cord and trachea where Dice, JAC, TPR, HD, AD, VS in DeepVoxel model were 0.80(range 0.68-0.91), 0.68(0.53-0.84), 0.78(0.67-0.94), 5.1(2.1-11.4), 1.9mm(1.0-3.9) and 0.88(0.76-0.97) respectively. Brachial plexus, optical nerves and chiasm, oral cavity and PC still needed improvement, partly due to the differences in organ definition. For example, teeth were included in DeepVoxel's oral cavity but not in Mirada and human-generated contours. Those discrepancies will be corrected in our next DL model. DL auto-generated contours from two different models showed high similarity to human generated ones for a variety of OARs in the head and neck, with potential to be adopted in routine clinical practice. In contrast with atlas-based or active shape model approaches, DL models are capable of producing contours with a high level of clinical acceptance and show promise to be indistinguishable from human generated ones. Table1: Quality of DL generated contours evaluated by various image segmentation metrics using human generated contours as the ground truth. All metrics showed consistent results across the organs. (Truncated)