Abstract:Activity recognition in surgical videos is a key research area for developing next-generation devices and workflow monitoring systems. Since surgeries are long processes with highly-variable lengths, deep learning models used for surgical videos often consist of a two-stage setup using a backbone and temporal sequence model. In this paper, we investigate many state-of-the-art backbones and temporal models to find architectures that yield the strongest performance for surgical activity recognition. We first benchmark the models performance on a large-scale activity recognition dataset containing over 800 surgery videos captured in multiple clinical operating rooms. We further evaluate the models on the two smaller public datasets, the Cholec80 and Cataract-101 datasets, containing only 80 and 101 videos respectively. We empirically found that Swin-Transformer+BiGRU temporal model yielded strong performance on both datasets. Finally, we investigate the adaptability of the model to new domains by fine-tuning models to a new hospital and experimenting with a recent unsupervised domain adaptation approach.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to perform effective activity recognition in long - time surgical videos**. Specifically, the research aims to explore and evaluate the performance of different deep - learning architectures (including backbone networks and time - series models) in surgical videos in order to find the model combination that is most suitable for surgical - activity recognition. ### Problem Background Surgical videos have the following characteristics: 1. **Long video length**: The surgical process may last for several hours, resulting in the need to model long - time dependencies. 2. **Complexity and diversity**: The surgical process is complex and involves multiple activities, and the data set is usually small and has the problem of class imbalance. 3. **Domain - specificity**: Surgical videos require professional medical knowledge for annotation, which increases the difficulty of data acquisition and model training. ### Solution To solve these problems, the author adopts a two - stage model architecture: 1. **Backbone Model**: It is used to extract features from video clips. Four types of backbone networks are used in the study: 3D CNN (such as I3D, SlowFast) and Transformer - based models (such as TimeSformer, Swin Transformer). 2. **Temporal Model**: It is used to generate global activity predictions based on the extracted features. Three types of time - series models are used in the study: CNN, RNN (such as BiGRU, UniGRU) and Transformer. ### Main Contributions 1. **Extensive model evaluation**: This is the first comprehensive evaluation of the performance of different deep - learning architectures on the surgical - activity - recognition task. 2. **Best model combination**: It is found that Swin Transformer + BiGRU is the best two - stage model combination, achieving a significant performance improvement on the Cholec80 data set (+2.32% accuracy, +0.35% precision, +3.44% recall). 3. **Efficient model selection**: The I3D backbone network is more efficient in terms of the number of parameters and FLOPs, achieving 98.9% of the performance of Swin Transformer. 4. **Cross - domain adaptability**: Through unsupervised domain - adaptation techniques, the model can achieve good generalization performance in new hospital environments. ### Summary Through extensive experiments on multiple surgical - video data sets, this research verifies the superiority of the Swin Transformer + BiGRU model in the surgical - activity - recognition task and provides valuable references for future research.

An Empirical Study on Activity Recognition in Long Surgical Videos

Adaptation of Surgical Activity Recognition Models Across Operating Rooms

Deep Learning for Surgical Workflow Analysis: a Survey of Progresses, Limitations, and Trends

Aggregating Long-Term Context for Learning Laparoscopic and Robot-Assisted Surgical Workflows

Surgical Phase Recognition in Inguinal Hernia Repair—AI-Based Confirmatory Baseline and Exploration of Competitive Models

Evaluating the Task Generalization of Temporal Convolutional Networks for Surgical Gesture and Motion Recognition using Kinematic Data

Human Gaze Guided Attention for Surgical Activity Recognition

Video-based Surgical Skills Assessment using Long term Tool Tracking

ST(OR)2: Spatio-Temporal Object Level Reasoning for Activity Recognition in the Operating Room

Efficient Surgical Tool Recognition via HMM-Stabilized Deep Learning

Identification of Cognitive Workload during Surgical Tasks with Multimodal Deep Learning

SR-Mamba: Effective Surgical Phase Recognition with State Space Model

Not End-to-End: Explore Multi-Stage Architecture for Online Surgical Phase Recognition

MIcro-surgical anastomose workflow recognition challenge report

Towards Generalizable Surgical Activity Recognition Using Spatial Temporal Graph Convolutional Networks

Quantification of Robotic Surgeries with Vision-Based Deep Learning

Laparoscopic Video Analysis Using Temporal, Attention, and Multi-Feature Fusion Based-Approaches

Surgical instrument recognition for instrument usage documentation and surgical video library indexing

Temporal-based Swin Transformer network for workflow recognition of surgical video

A real-time spatiotemporal AI model analyzes skill in open surgical videos

Surgical Phase Recognition of Short Video Shots Based on Temporal Modeling of Deep Features