Abstract:Purpose: Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set, resulting in poor generalizability. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing (geometric understanding). This approach takes advantage of the recent vision foundation models that ensure reliable low-level scene understanding to craft DT-based scene representations that support various high-level tasks. Methods: We present a DT-based framework for SPR from videos. The framework employs vision foundation models to extract representations. We embed the representation in place of raw video inputs in the state-of-the-art Surgformer model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Results: Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 51.1 on the challenging CRCD dataset, 96.0 on an internal robotics training dataset, and 64.4 on a highly corrupted Cholec80 test set. Conclusion: Our findings lend support to the thesis that DT-based scene representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness, automate feature extraction, and incorporate interpretability for a more comprehensive framework.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Improve the robustness of the Surgical Phase Recognition (SPR) model in the face of video variations**. Specifically, although traditional end - to - end trained neural networks perform well in benchmark tests, they perform poorly when dealing with out - of - distribution (OOD) samples and contaminated images. The main reason is that non - causal associations in the training set lead to poor generalization ability. To solve this problem, the paper proposes a new framework based on Digital Twin (DT) scene representation. By separating high - level analysis (such as surgical phase recognition) and low - level processing (such as geometric understanding), the robustness of the model is enhanced. ### Specific Problem Description 1. **Limitations of Existing Models**: - Traditional end - to - end trained neural networks perform poorly when dealing with out - of - distribution samples and contaminated images. - These models are easily affected by non - causal associations in the training set, resulting in poor generalization ability. 2. **Objectives**: - Improve the robustness of the surgical phase recognition model in the face of video variations. - By introducing the Digital Twin (DT) paradigm, construct an intermediate layer to separate high - level analysis and low - level processing, thereby reducing non - causal learning and enhancing the robustness of the model. ### Solution The paper proposes a framework based on Digital Twin (DT) scene representation, which is achieved through the following steps: 1. **Representation Extraction**: - Use visual foundation models (such as SAM2 and DepthAnything) to extract basic representations from the original video, including segmentation masks and depth maps. 2. **DT - based Patch Embedding**: - Convert the extracted representations into latent representations and format them into spatio - temporal sequence embeddings suitable for Transformer encoder input. 3. **DT - based SPR**: - Use state - of - the - art SPR models (such as Surgformer) to use the embedded representations for training and inference to achieve surgical phase recognition with enhanced robustness. ### Experimental Verification The paper verifies the effectiveness of the proposed method through the following several experiments: 1. **OOD Generalization Experiment**: - Evaluate the generalization ability of the model on the Cholec80, CRCD, and internal robot training datasets. 2. **Anti - interference Experiment**: - Apply multiple image interferences (such as hue transformation, brightness adjustment, and contrast adjustment) to evaluate the robustness of the model. 3. **Ablation Experiment**: - Explore the influence of depth information and segmentation information on the performance of the model and verify the effectiveness of DT - based scene representation. ### Conclusion The research results of the paper show that the framework based on Digital Twin scene representation shows a significant improvement in robustness when dealing with out - of - distribution samples and contaminated images, verifying the effectiveness of the DT paradigm in enhancing the robustness of surgical data analysis models. Future research will further increase the information content of feature representations, improve the feature extraction pipeline, and explore Explainable AI techniques to enhance the feasibility of clinical translation.

Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin-based Scene Representation

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding

Not End-to-End: Explore Multi-Stage Architecture for Online Surgical Phase Recognition

Surgical Phase Recognition in Inguinal Hernia Repair—AI-Based Confirmatory Baseline and Exploration of Competitive Models

Thoracic Surgery Video Analysis for Surgical Phase Recognition

Quantification of Robotic Surgeries with Vision-Based Deep Learning

Neural Finite-State Machines for Surgical Phase Recognition

Neural Rendering for Stereo 3D Reconstruction of Deformable Tissues in Robotic Surgery

E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception

Surgical Temporal Action-aware Network with Sequence Regularization for Phase Recognition

Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

SuPer Deep: A Surgical Perception Framework for Robotic Tissue Manipulation using Deep Learning for Feature Extraction

Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition

Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection

SurgPLAN: Surgical Phase Localization Network for Phase Recognition

Robust Surgical Phase Recognition From Annotation Efficient Supervision

SurgPLAN++: Universal Surgical Phase Localization Network for Online and Offline Inference

Simultaneous Recognition and Pose Estimation of Instruments in Minimally Invasive Surgery

Towards Better Surgical Instrument Segmentation in Endoscopic Vision: Multi-Angle Feature Aggregation and Contour Supervision