Comprehensive Clinical Usability-oriented Contour Quality Evaluation for Deep learning Auto-segmentation: Combining Multiple Quantitative Metrics through Machine Learning

Ying Zhang,Asma Amjad,Jie Ding,Christina Sarosiek,Mohammad Zarenia,Renae Conlin,William A Hall,Beth Erickson,Eric Paulson
DOI: https://doi.org/10.1016/j.prro.2024.07.007
2024-09-02
Abstract:Purpose: The current commonly-used metrics for evaluating the quality of auto-segmented contours have limitations and do not always reflect the clinical usefulness of the contours. This work aims to develop a novel contour quality classification (CQC) method by combining multiple quantitative metrics for clinical usability-oriented contour quality evaluation for deep learning-based auto-segmentation (DLAS). Methods: The CQC was designed to categorize contours on slices as acceptable, minor edit, or major edit based on the expected editing effort/time with supervised ensemble tree classification models using seven quantitative metrics. Organ-specific models were trained for five abdominal organs (pancreas, duodenum, stomach, small and large-bowels) using 50 MRI datasets. Twenty additional MRI and nine CT datasets were employed for testing. Inter-observer variation (IOV) was assessed among six observers and consensus labels were established through majority vote for evaluation. The CQC was also compared with a threshold-based baseline approach. Results: For the five organs, the average AUC was 0.982±0.01 and 0.979±0.01, the mean-accuracy was 95.8±1.7% and 94.3±2.1%, and the mean risk-rate was 0.8±0.4% and 0.7±0.5% for MRI and CT testing dataset, respectively. The CQC results closely matched the IOV results (mean-accuracy of 94.2±0.8% and 94.8±1.7%) and were significantly higher than those obtained using the threshold-based method (mean-accuracy of 80.0±4.7%, 83.8±5.2%, and 77.3±6.6% using one, two, and three metrics). Conclusion: The CQC models demonstrated high performance in classifying the quality of contour slices. This method can address the limitations of existing metrics and offers an intuitive and comprehensive solution for clinically oriented evaluation and comparison of DLAS systems.
What problem does this paper attempt to address?