Deep learning for [18F]fluorodeoxyglucose-PET-CT classification in patients with lymphoma: a dual-centre retrospective analysis
Ida Häggström,Doris Leithner,Jennifer Alvén,Gabriele Campanella,Murad Abusamra,Honglei Zhang,Shalini Chhabra,Lucian Beer,Alexander Haug,Gilles Salles,Markus Raderer,Philipp B Staber,Anton Becker,Hedvig Hricak,Thomas J Fuchs,Heiko Schöder,Marius E Mayerhoefer
DOI: https://doi.org/10.1016/S2589-7500(23)00203-0
Abstract:Background: The rising global cancer burden has led to an increasing demand for imaging tests such as [18F]fluorodeoxyglucose ([18F]FDG)-PET-CT. To aid imaging specialists in dealing with high scan volumes, we aimed to train a deep learning artificial intelligence algorithm to classify [18F]FDG-PET-CT scans of patients with lymphoma with or without hypermetabolic tumour sites. Methods: In this retrospective analysis we collected 16 583 [18F]FDG-PET-CTs of 5072 patients with lymphoma who had undergone PET-CT before or after treatment at the Memorial Sloa Kettering Cancer Center, New York, NY, USA. Using maximum intensity projection (MIP), three dimensional (3D) PET, and 3D CT data, our ResNet34-based deep learning model (Lymphoma Artificial Reader System [LARS]) for [18F]FDG-PET-CT binary classification (Deauville 1-3 vs 4-5), was trained on 80% of the dataset, and tested on 20% of this dataset. For external testing, 1000 [18F]FDG-PET-CTs were obtained from a second centre (Medical University of Vienna, Vienna, Austria). Seven model variants were evaluated, including MIP-based LARS-avg (optimised for accuracy) and LARS-max (optimised for sensitivity), and 3D PET-CT-based LARS-ptct. Following expert curation, areas under the curve (AUCs), accuracies, sensitivities, and specificities were calculated. Findings: In the internal test cohort (3325 PET-CTs, 1012 patients), LARS-avg achieved an AUC of 0·949 (95% CI 0·942-0·956), accuracy of 0·890 (0·879-0·901), sensitivity of 0·868 (0·851-0·885), and specificity of 0·913 (0·899-0·925); LARS-max achieved an AUC of 0·949 (0·942-0·956), accuracy of 0·868 (0·858-0·879), sensitivity of 0·909 (0·896-0·924), and specificity of 0·826 (0·808-0·843); and LARS-ptct achieved an AUC of 0·939 (0·930-0·948), accuracy of 0·875 (0·864-0·887), sensitivity of 0·836 (0·817-0·855), and specificity of 0·915 (0·901-0·927). In the external test cohort (1000 PET-CTs, 503 patients), LARS-avg achieved an AUC of 0·953 (0·938-0·966), accuracy of 0·907 (0·888-0·925), sensitivity of 0·874 (0·843-0·904), and specificity of 0·949 (0·921-0·960); LARS-max achieved an AUC of 0·952 (0·937-0·965), accuracy of 0·898 (0·878-0·916), sensitivity of 0·899 (0·871-0·926), and specificity of 0·897 (0·871-0·922); and LARS-ptct achieved an AUC of 0·932 (0·915-0·948), accuracy of 0·870 (0·850-0·891), sensitivity of 0·827 (0·793-0·863), and specificity of 0·913 (0·889-0·937). Interpretation: Deep learning accurately distinguishes between [18F]FDG-PET-CT scans of lymphoma patients with and without hypermetabolic tumour sites. Deep learning might therefore be potentially useful to rule out the presence of metabolically active disease in such patients, or serve as a second reader or decision support tool. Funding: National Institutes of Health-National Cancer Institute Cancer Center Support Grant.