Fully Automated Region-Specific Human-Perceptive-Equivalent Image Quality Assessment: Application to 18 F-FDG PET Scans
Mehdi Amini,Yazdan Salimi,Ghasem Hajianfar,Ismini Mainta,Elsa Hervier,Amirhossein Sanaat,Arman Rahmim,Isaac Shiri,Habib Zaidi
DOI: https://doi.org/10.1097/RLU.0000000000005526
2024-12-01
Abstract:Introduction: We propose a fully automated framework to conduct a region-wise image quality assessment (IQA) on whole-body 18 F-FDG PET scans. This framework (1) can be valuable in daily clinical image acquisition procedures to instantly recognize low-quality scans for potential rescanning and/or image reconstruction, and (2) can make a significant impact in dataset collection for the development of artificial intelligence-driven 18 F-FDG PET analysis models by rejecting low-quality images and those presenting with artifacts, toward building clean datasets. Patients and methods: Two experienced nuclear medicine physicians separately evaluated the quality of 174 18 F-FDG PET images from 87 patients, for each body region, based on a 5-point Likert scale. The body regisons included the following: (1) the head and neck, including the brain, (2) the chest, (3) the chest-abdomen interval (diaphragmatic region), (4) the abdomen, and (5) the pelvis. Intrareader and interreader reproducibility of the quality scores were calculated using 39 randomly selected scans from the dataset. Utilizing a binarized classification, images were dichotomized into low-quality versus high-quality for physician quality scores ≤3 versus >3, respectively. Inputting the 18 F-FDG PET/CT scans, our proposed fully automated framework applies 2 deep learning (DL) models on CT images to perform region identification and whole-body contour extraction (excluding extremities), then classifies PET regions as low and high quality. For classification, 2 mainstream artificial intelligence-driven approaches, including machine learning (ML) from radiomic features and DL, were investigated. All models were trained and evaluated on scores attributed by each physician, and the average of the scores reported. DL and radiomics-ML models were evaluated on the same test dataset. The performance evaluation was carried out on the same test dataset for radiomics-ML and DL models using the area under the curve, accuracy, sensitivity, and specificity and compared using the Delong test with P values <0.05 regarded as statistically significant. Results: In the head and neck, chest, chest-abdomen interval, abdomen, and pelvis regions, the best models achieved area under the curve, accuracy, sensitivity, and specificity of [0.97, 0.95, 0.96, and 0.95], [0.85, 0.82, 0.87, and 0.76], [0.83, 0.76, 0.68, and 0.80], [0.73, 0.72, 0.64, and 0.77], and [0.72, 0.68, 0.70, and 0.67], respectively. In all regions, models revealed highest performance, when developed on the quality scores with higher intrareader reproducibility. Comparison of DL and radiomics-ML models did not show any statistically significant differences, though DL models showed overall improved trends. Conclusions: We developed a fully automated and human-perceptive equivalent model to conduct region-wise IQA over 18 F-FDG PET images. Our analysis emphasizes the necessity of developing separate models for body regions and performing data annotation based on multiple experts' consensus in IQA studies.