Event Recognition in Laparoscopic Gynecology Videos with Hybrid Transformers

Sahar Nasirihaghighi,Negin Ghamsarian,Heinrich Husslein,Klaus Schoeffmann
2023-12-01
Abstract:Analyzing laparoscopic surgery videos presents a complex and multifaceted challenge, with applications including surgical training, intra-operative surgical complication prediction, and post-operative surgical assessment. Identifying crucial events within these videos is a significant prerequisite in a majority of these applications. In this paper, we introduce a comprehensive dataset tailored for relevant event recognition in laparoscopic gynecology videos. Our dataset includes annotations for critical events associated with major intra-operative challenges and post-operative complications. To validate the precision of our annotations, we assess event recognition performance using several CNN-RNN architectures. Furthermore, we introduce and evaluate a hybrid transformer architecture coupled with a customized training-inference framework to recognize four specific events in laparoscopic surgery videos. Leveraging the Transformer networks, our proposed architecture harnesses inter-frame dependencies to counteract the adverse effects of relevant content occlusion, motion blur, and surgical scene variation, thus significantly enhancing event recognition accuracy. Moreover, we present a frame sampling strategy designed to manage variations in surgical scenes and the surgeons' skill level, resulting in event recognition with high temporal resolution. We empirically demonstrate the superiority of our proposed methodology in event recognition compared to conventional CNN-RNN architectures through a series of extensive experiments.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of event recognition in laparoscopic gynecological surgery videos. Specifically, the authors focus on the following key problems: 1. **Complexity and Diversity**: Analyzing laparoscopic surgery videos is a complex and multifaceted challenge involving various application scenarios such as surgical training, intraoperative complication prediction, and postoperative evaluation. Recognizing key events in these videos is a prerequisite for most applications. 2. **Lack of Datasets**: Existing datasets are insufficient to support comprehensive recognition of key events in laparoscopic gynecological surgery videos. Therefore, the authors have constructed a comprehensive dataset specifically for this type of event recognition. 3. **Technical Limitations**: Traditional Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architectures have limitations when processing laparoscopic surgery videos, especially in dealing with challenges such as content occlusion, motion blur, and changes in surgical scenes. The authors propose a hybrid architecture that incorporates Transformers to improve the accuracy of event recognition. 4. **Real-time and High Resolution**: Existing methods often fail to achieve high temporal resolution in event recognition when processing surgical videos, limiting their effectiveness in practical applications. The authors propose a frame sampling strategy to manage changes in surgical scenes and the skill levels of surgeons, thereby achieving high temporal resolution in event recognition. ### Solutions To address the aforementioned problems, the authors have taken the following measures: 1. **Dataset Construction**: The authors have constructed a dataset containing 174 laparoscopic surgery videos, each annotated by clinical experts with four key events: abdominal entry, bleeding, coagulation/cutting, and suturing. These events are related to major intraoperative challenges and postoperative complications. 2. **Proposed Hybrid Transformer Model**: The authors propose a hybrid architecture combining CNN and Transformer, utilizing the self-attention mechanism of Transformers to capture inter-frame dependencies, thereby improving the accuracy of event recognition. This model effectively addresses issues such as content occlusion, motion blur, and changes in surgical scenes. 3. **Frame Sampling Strategy**: To manage changes in surgical scenes and the skill levels of surgeons, the authors designed a frame sampling strategy to ensure high temporal resolution in event recognition. 4. **Experimental Validation**: Through a series of extensive experiments, the authors validated the superiority of the proposed hybrid Transformer model in the event recognition task, particularly excelling in recognizing the abdominal entry event. ### Conclusion By constructing a specialized dataset and proposing a hybrid Transformer model, this paper effectively addresses the issue of event recognition in laparoscopic gynecological surgery videos. Experimental results show that the proposed method outperforms traditional CNN-RNN architectures on multiple metrics, demonstrating higher accuracy and robustness, especially when dealing with complex and diverse surgical videos.