Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy

Ronald L.P.D. de Jong,Yasmina al Khalil,Tim J.M. Jaspers,Romy C. van Jaarsveld,Gino M. Kuiper,Yiping Li,Richard van Hillegersberg,Jelle P. Ruurda,Marcel Breeuwer,Fons van der Sommen
2024-12-04
Abstract:Esophageal cancer is among the most common types of cancer worldwide. It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative. However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation. Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited. In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves. This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem. Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets. We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues. The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks. Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet. Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of real - time anatomical structure recognition in robot - assisted esophagectomy (RAMIE). Specifically, the research focuses on the following aspects: 1. **Surgical navigation challenges**: Although robot - assisted minimally invasive surgery (RAMIE) reduces surgical trauma and complications, for novice surgeons, surgical navigation becomes very challenging due to the loss of a sense of spatial orientation and the difficulty in recognizing complex anatomical structures. Computer - assisted anatomical recognition is expected to improve this problem. 2. **Insufficient data sets**: Currently, there are relatively few studies on multi - organ or multi - structure segmentation for RAMIE, and there is a lack of a comprehensive data set containing multiple key anatomical structures and surgical instruments. To this end, the authors created a new RAMIE data set, covering 879 frames of images from 32 patients and annotating 12 different categories (including 4 surgical instruments and 8 key anatomical structures). 3. **Model performance evaluation**: In order to evaluate the performance of existing algorithms when dealing with the new data set, the authors selected eight real - time deep - learning models for benchmark testing, including traditional convolutional neural networks (CNN) and attention - based networks. These models were pre - trained on two pre - training data sets (ImageNet and ADE20k) respectively to evaluate their performance in semantic segmentation tasks. 4. **Challenges and limitations**: The study also explored the challenges faced by the current state - of - the - art algorithms when dealing with the new data set, such as class imbalance, recognition of complex structures (such as nerves), and occlusion problems (such as occlusion caused by blood or other tissues). In particular, the authors hypothesized that attention - based networks can better capture global patterns and deal with occlusion problems. ### Main objectives - **Develop a comprehensive data set**: Create a high - quality data set covering multiple anatomical structures and surgical instruments to support more extensive semantic segmentation research. - **Evaluate the performance of different models**: By comparing traditional CNNs and attention - based networks, evaluate their segmentation effects on the RAMIE data set, especially focusing on whether attention - based networks can better handle complex scenarios. - **Optimize pre - training strategies**: Determine which pre - training data set (ImageNet vs ADE20k) is more effective for the segmentation task of the RAMIE data set. - **Improve the learning curve of novice surgeons**: By improving surgical navigation tools, help novice surgeons master the RAMIE technique more quickly and reduce surgical risks. ### Conclusions The research shows that attention - based models (such as SegNeXt and Mask2Former) perform well in semantic segmentation tasks, especially when dealing with small classes and occlusion problems. ADE20k is more effective as a pre - training data set than ImageNet. Future research should further explore more pre - training methods in the field to improve model performance and increase the amount of data on key anatomical structures (such as nerves and thoracic ducts).