Abstract:Esophageal cancer is among the most common types of cancer worldwide. It is traditionally treated using open esophagectomy, but in recent years, robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative. However, robot-assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation. Computer-aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited. In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves. This study aims to understand the challenges and limitations of current state-of-the-art algorithms on this novel dataset and problem. Therefore, we benchmarked eight real-time deep learning models using two pretraining datasets. We assessed both traditional and attention-based networks, hypothesizing that attention-based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues. The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks. Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet. Furthermore, attention-based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of real - time anatomical structure recognition in robot - assisted esophagectomy (RAMIE). Specifically, the research focuses on the following aspects: 1. **Surgical navigation challenges**: Although robot - assisted minimally invasive surgery (RAMIE) reduces surgical trauma and complications, for novice surgeons, surgical navigation becomes very challenging due to the loss of a sense of spatial orientation and the difficulty in recognizing complex anatomical structures. Computer - assisted anatomical recognition is expected to improve this problem. 2. **Insufficient data sets**: Currently, there are relatively few studies on multi - organ or multi - structure segmentation for RAMIE, and there is a lack of a comprehensive data set containing multiple key anatomical structures and surgical instruments. To this end, the authors created a new RAMIE data set, covering 879 frames of images from 32 patients and annotating 12 different categories (including 4 surgical instruments and 8 key anatomical structures). 3. **Model performance evaluation**: In order to evaluate the performance of existing algorithms when dealing with the new data set, the authors selected eight real - time deep - learning models for benchmark testing, including traditional convolutional neural networks (CNN) and attention - based networks. These models were pre - trained on two pre - training data sets (ImageNet and ADE20k) respectively to evaluate their performance in semantic segmentation tasks. 4. **Challenges and limitations**: The study also explored the challenges faced by the current state - of - the - art algorithms when dealing with the new data set, such as class imbalance, recognition of complex structures (such as nerves), and occlusion problems (such as occlusion caused by blood or other tissues). In particular, the authors hypothesized that attention - based networks can better capture global patterns and deal with occlusion problems. ### Main objectives - **Develop a comprehensive data set**: Create a high - quality data set covering multiple anatomical structures and surgical instruments to support more extensive semantic segmentation research. - **Evaluate the performance of different models**: By comparing traditional CNNs and attention - based networks, evaluate their segmentation effects on the RAMIE data set, especially focusing on whether attention - based networks can better handle complex scenarios. - **Optimize pre - training strategies**: Determine which pre - training data set (ImageNet vs ADE20k) is more effective for the segmentation task of the RAMIE data set. - **Improve the learning curve of novice surgeons**: By improving surgical navigation tools, help novice surgeons master the RAMIE technique more quickly and reduce surgical risks. ### Conclusions The research shows that attention - based models (such as SegNeXt and Mask2Former) perform well in semantic segmentation tasks, especially when dealing with small classes and occlusion problems. ADE20k is more effective as a pre - training data set than ImageNet. Future research should further explore more pre - training methods in the field to improve model performance and increase the amount of data on key anatomical structures (such as nerves and thoracic ducts).

Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy

Benchmarking and Enhancing Surgical Phase Recognition Models for Robotic-Assisted Esophagectomy

Automated Surgical-Phase Recognition for Robot-Assisted Minimally Invasive Esophagectomy Using Artificial Intelligence

Vision-Based Real-Time Tracking of Surgical Instruments in Robot-Assisted Laparoscopic Surgery

Surgical Activity Recognition in Robot-Assisted Radical Prostatectomy using Deep Learning

Esophageal Squamous Cell Carcinoma Recognition Based on Lightweight Residual Networks with an Attention Mechanism

Searching for Efficient Architecture for Instrument Segmentation in Robotic Surgery

Automated Surgical Skill Assessment in Endoscopic Pituitary Surgery using Real-time Instrument Tracking on a High-fidelity Bench-top Phantom

Active learning for extracting surgomic features in robot-assisted minimally invasive esophagectomy: a prospective annotation study

Gesture Recognition in Robotic Surgery With Multimodal Attention

Towards Better Surgical Instrument Segmentation in Endoscopic Vision: Multi-Angle Feature Aggregation and Contour Supervision

Deep Learning Model for Real‑time Semantic Segmentation During Intraoperative Robotic Prostatectomy

Less is More: Surgical Phase Recognition with Less Annotations through Self-Supervised Pre-training of CNN-LSTM Networks

Real-Time Instrument Segmentation in Robotic Surgery using Auxiliary Supervised Deep Adversarial Learning

A Preliminary Exploration to Make Stereotactic Surgery Robots Aware of the Semantic 2D/3D Working Scene

Surgical Phase Recognition in Inguinal Hernia Repair—AI-Based Confirmatory Baseline and Exploration of Competitive Models

Attention-Guided Lightweight Network for Real-Time Segmentation of Robotic Surgical Instruments

Pixel-wise Contrastive Learning for Multi-class Instrument Segmentation in Endoscopic Robotic Surgery Videos Using Dataset-wide Sample Queues

Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection

Computer-aided anatomy recognition in intrathoracic and -abdominal surgery: a systematic review

[REPORTS OF THE PHARMACOPOEIAL COMMISSION. 1964, NO. 2].