Co-Learning Multi-Modality PET-CT Features Via a Cascaded CNN-Transformer Network

Lei Bi,Xiaohang Fu,Qiufang Liu,Shaoli Song,David Dagan Feng,Michael Fulham,Jinman Kim
DOI: https://doi.org/10.1109/trpms.2024.3417901
2024-01-01
IEEE Transactions on Radiation and Plasma Medical Sciences
Abstract:Background: Automated segmentation of multimodality positron emission tomography-computed tomography (PET-CT) data is a major challenge in the development of computer-aided diagnosis systems (CADs). In this context, convolutional neural network (CNN)-based methods are considered as the state-of-the-art. These CNN-based methods, however, have difficulty in co-learning the complementary PET-CT image features and in learning the global context when focusing solely on local patterns. Methods: We propose a cascaded CNN-transformer network (CCNN-TN) tailored for PET-CT image segmentation. We employed a transformer network (TN) because of its ability to establish global context via self-attention and embedding image patches. We extended the TN definition by cascading multiple TNs and CNNs to learn the global and local contexts. We also introduced a hyper fusion branch that iteratively fuses the separately extracted complementary image features. We evaluated our approach, when compared to current state-of-the-art CNN methods, on three datasets: two nonsmall cell lung cancer (NSCLC) and one soft tissue sarcoma (STS). Results: Our CCNN-TN method achieved a dice similarity coefficient (DSC) score of 72.25% (NSCLC), 67.11% (NSCLC), and 66.36% (STS) for segmentation of tumors. Compared to other methods the DSC was higher for our CCNN-TN by 4.5%, 1.31%, and 3.44%. Conclusion: Our experimental results demonstrate that CCNN-TN, when compared to the existing methods, achieved more generalizable results across different datasets and has consistent performance across various image fusion strategies and network backbones.
What problem does this paper attempt to address?