Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin-based Scene Representation

Hao Ding,Yuqian Zhang,Hongchao Shu,Xu Lian,Ji Woong Kim,Axel Krieger,Mathias Unberath
2024-10-26
Abstract:Purpose: Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set, resulting in poor generalizability. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing (geometric understanding). This approach takes advantage of the recent vision foundation models that ensure reliable low-level scene understanding to craft DT-based scene representations that support various high-level tasks. Methods: We present a DT-based framework for SPR from videos. The framework employs vision foundation models to extract representations. We embed the representation in place of raw video inputs in the state-of-the-art Surgformer model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Results: Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 51.1 on the challenging CRCD dataset, 96.0 on an internal robotics training dataset, and 64.4 on a highly corrupted Cholec80 test set. Conclusion: Our findings lend support to the thesis that DT-based scene representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness, automate feature extraction, and incorporate interpretability for a more comprehensive framework.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Improve the robustness of the Surgical Phase Recognition (SPR) model in the face of video variations**. Specifically, although traditional end - to - end trained neural networks perform well in benchmark tests, they perform poorly when dealing with out - of - distribution (OOD) samples and contaminated images. The main reason is that non - causal associations in the training set lead to poor generalization ability. To solve this problem, the paper proposes a new framework based on Digital Twin (DT) scene representation. By separating high - level analysis (such as surgical phase recognition) and low - level processing (such as geometric understanding), the robustness of the model is enhanced. ### Specific Problem Description 1. **Limitations of Existing Models**: - Traditional end - to - end trained neural networks perform poorly when dealing with out - of - distribution samples and contaminated images. - These models are easily affected by non - causal associations in the training set, resulting in poor generalization ability. 2. **Objectives**: - Improve the robustness of the surgical phase recognition model in the face of video variations. - By introducing the Digital Twin (DT) paradigm, construct an intermediate layer to separate high - level analysis and low - level processing, thereby reducing non - causal learning and enhancing the robustness of the model. ### Solution The paper proposes a framework based on Digital Twin (DT) scene representation, which is achieved through the following steps: 1. **Representation Extraction**: - Use visual foundation models (such as SAM2 and DepthAnything) to extract basic representations from the original video, including segmentation masks and depth maps. 2. **DT - based Patch Embedding**: - Convert the extracted representations into latent representations and format them into spatio - temporal sequence embeddings suitable for Transformer encoder input. 3. **DT - based SPR**: - Use state - of - the - art SPR models (such as Surgformer) to use the embedded representations for training and inference to achieve surgical phase recognition with enhanced robustness. ### Experimental Verification The paper verifies the effectiveness of the proposed method through the following several experiments: 1. **OOD Generalization Experiment**: - Evaluate the generalization ability of the model on the Cholec80, CRCD, and internal robot training datasets. 2. **Anti - interference Experiment**: - Apply multiple image interferences (such as hue transformation, brightness adjustment, and contrast adjustment) to evaluate the robustness of the model. 3. **Ablation Experiment**: - Explore the influence of depth information and segmentation information on the performance of the model and verify the effectiveness of DT - based scene representation. ### Conclusion The research results of the paper show that the framework based on Digital Twin scene representation shows a significant improvement in robustness when dealing with out - of - distribution samples and contaminated images, verifying the effectiveness of the DT paradigm in enhancing the robustness of surgical data analysis models. Future research will further increase the information content of feature representations, improve the feature extraction pipeline, and explore Explainable AI techniques to enhance the feasibility of clinical translation.