Comprehensive framework for evaluation of deep neural networks in detection and quantification of lymphoma from PET/CT images: clinical insights, pitfalls, and observer agreement analyses

Shadab Ahamed,Yixi Xu,Sara Kurkowska,Claire Gowdy,Joo H. O,Ingrid Bloise,Don Wilson,Patrick Martineau,François Bénard,Fereshteh Yousefirizi,Rahul Dodhia,Juan M. Lavista,William B. Weeks,Carlos F. Uribe,Arman Rahmim
2024-12-02
Abstract:This study addresses critical gaps in automated lymphoma segmentation from PET/CT images, focusing on issues often overlooked in existing literature. While deep learning has been applied for lymphoma lesion segmentation, few studies incorporate out-of-distribution testing, raising concerns about model generalizability across diverse imaging conditions and patient populations. We highlight the need to compare model performance with expert human annotators, including intra- and inter-observer variability, to understand task difficulty better. Most approaches focus on overall segmentation accuracy but overlook lesion-specific metrics important for precise lesion detection and disease <a class="link-external link-http" href="http://quantification.To" rel="external noopener nofollow">this http URL</a> address these gaps, we propose a clinically-relevant framework for evaluating deep neural networks. Using this lesion-specific evaluation, we assess the performance of four deep segmentation networks (ResUNet, SegResNet, DynUNet, and SwinUNETR) across 611 cases from multi-institutional datasets, covering various lymphoma subtypes and lesion characteristics. Beyond standard metrics like the Dice similarity coefficient (DSC), we evaluate clinical lesion measures and their prediction errors. We also introduce detection criteria for lesion localization and propose a new detection Criterion 3 based on metabolic characteristics. We show that networks perform better on large, intense lesions with higher metabolic <a class="link-external link-http" href="http://activity.Finally" rel="external noopener nofollow">this http URL</a>, we compare network performance to expert human observers via intra- and inter-observer variability analyses, demonstrating that network errors closely resemble those made by experts. Some small, faint lesions remain challenging for both humans and networks. This study aims to improve automated lesion segmentation's clinical relevance, supporting better treatment decisions for lymphoma patients. The code is available at: <a class="link-external link-https" href="https://github.com/microsoft/lymphoma-segmentation-dnn" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in the automatic segmentation and quantification process of lymphoma lesions in PET/CT images. Specifically, the research focuses on solving the following problems: 1. **Insufficient model generalization ability**: - **External validation and out - of - distribution testing**: Most of the existing studies lack testing on external or out - of - distribution datasets, which makes the generalization ability of the model in different imaging conditions and patient groups questionable. - **Application of multi - institutional datasets**: In order to improve the generalization ability of the model, this study uses datasets from multiple institutions for external validation. 2. **Lack of comprehensive comparison with expert annotations**: - **Analysis of human observer variability**: Existing studies rarely comprehensively compare the performance of deep learning models with that of expert human annotators, especially in analyzing the variability between internal and external observers. This makes it difficult to evaluate the true clinical application value of the model. 3. **Ignoring lesion - specific indicators**: - **Overall segmentation accuracy vs. lesion - specific indicators**: Most existing methods focus on overall segmentation accuracy (such as Dice similarity coefficient), while ignoring lesion - specific indicators (such as lesion size, metabolic activity, etc.) that reflect clinical needs. These lesion - specific indicators are crucial for accurate disease detection and quantification. 4. **Lack of clinical relevance**: - **Clinical lesion measurement standards**: The study proposes a strict clinically - relevant framework for evaluating the performance of deep neural networks to ensure that the model output can be aligned with the actual diagnostic requirements and enhance clinical relevance. ### Method overview To solve the above problems, the research adopts the following methods: - **Multi - institutional datasets**: Utilize PET/CT image data of 611 cases from four different institutions, covering different lymphoma subtypes and lesion characteristics. - **Four commonly - used deep segmentation networks**: Evaluate the performance of four commonly - used deep segmentation networks, namely ResUNet, SegResNet, DynUNet and SwinUNETR. - **Comprehensive evaluation framework**: Not only use standard segmentation indicators (such as Dice similarity coefficient), but also introduce clinical lesion measurement standards, calculate prediction errors, and analyze the relationship between DSC performance and lesion measurement. - **Detection standards**: Propose three detection standards (Criterion 1, 2, 3) to evaluate the performance of the network in identifying and locating lesions, especially for lesion segmentation based on metabolic characteristics. - **Comparison with expert annotations**: Through the analysis of internal and external observer variability, compare the network performance with expert human annotators and show the similarity between network errors and human expert errors. ### Conclusion Through extensive analysis, the research shows that: - Deep learning networks show better performance when dealing with large and metabolically active lesions. - The error patterns of the network are very similar to those of human experts. - Small and weak lesions are challenging even for expert physicians and are difficult to be segmented consistently. In summary, this study aims to achieve more consistent and clinically - relevant automatic lesion segmentation, support robust decision - making in lymphoma treatment and management, and can be easily extended to other deep learning networks. The code has been publicly shared to promote reproducibility and further research progress.