An Empirical Study on Activity Recognition in Long Surgical Videos

Zhuohong He,Ali Mottaghi,Aidean Sharghi,Muhammad Abdullah Jamal,Omid Mohareri
DOI: https://doi.org/10.48550/arXiv.2205.02805
2022-09-07
Abstract:Activity recognition in surgical videos is a key research area for developing next-generation devices and workflow monitoring systems. Since surgeries are long processes with highly-variable lengths, deep learning models used for surgical videos often consist of a two-stage setup using a backbone and temporal sequence model. In this paper, we investigate many state-of-the-art backbones and temporal models to find architectures that yield the strongest performance for surgical activity recognition. We first benchmark the models performance on a large-scale activity recognition dataset containing over 800 surgery videos captured in multiple clinical operating rooms. We further evaluate the models on the two smaller public datasets, the Cholec80 and Cataract-101 datasets, containing only 80 and 101 videos respectively. We empirically found that Swin-Transformer+BiGRU temporal model yielded strong performance on both datasets. Finally, we investigate the adaptability of the model to new domains by fine-tuning models to a new hospital and experimenting with a recent unsupervised domain adaptation approach.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to perform effective activity recognition in long - time surgical videos**. Specifically, the research aims to explore and evaluate the performance of different deep - learning architectures (including backbone networks and time - series models) in surgical videos in order to find the model combination that is most suitable for surgical - activity recognition. ### Problem Background Surgical videos have the following characteristics: 1. **Long video length**: The surgical process may last for several hours, resulting in the need to model long - time dependencies. 2. **Complexity and diversity**: The surgical process is complex and involves multiple activities, and the data set is usually small and has the problem of class imbalance. 3. **Domain - specificity**: Surgical videos require professional medical knowledge for annotation, which increases the difficulty of data acquisition and model training. ### Solution To solve these problems, the author adopts a two - stage model architecture: 1. **Backbone Model**: It is used to extract features from video clips. Four types of backbone networks are used in the study: 3D CNN (such as I3D, SlowFast) and Transformer - based models (such as TimeSformer, Swin Transformer). 2. **Temporal Model**: It is used to generate global activity predictions based on the extracted features. Three types of time - series models are used in the study: CNN, RNN (such as BiGRU, UniGRU) and Transformer. ### Main Contributions 1. **Extensive model evaluation**: This is the first comprehensive evaluation of the performance of different deep - learning architectures on the surgical - activity - recognition task. 2. **Best model combination**: It is found that Swin Transformer + BiGRU is the best two - stage model combination, achieving a significant performance improvement on the Cholec80 data set (+2.32% accuracy, +0.35% precision, +3.44% recall). 3. **Efficient model selection**: The I3D backbone network is more efficient in terms of the number of parameters and FLOPs, achieving 98.9% of the performance of Swin Transformer. 4. **Cross - domain adaptability**: Through unsupervised domain - adaptation techniques, the model can achieve good generalization performance in new hospital environments. ### Summary Through extensive experiments on multiple surgical - video data sets, this research verifies the superiority of the Swin Transformer + BiGRU model in the surgical - activity - recognition task and provides valuable references for future research.