Abstract:Objective: Surgical activity recognition is a fundamental step in computer-assisted interventions. This paper reviews the state-of-the-art in methods for automatic recognition of fine-grained gestures in robotic surgery focusing on recent data-driven approaches and outlines the open questions and future research directions. Methods: An article search was performed on 5 bibliographic databases with the following search terms: robotic, robot-assisted, JIGSAWS, surgery, surgical, gesture, fine-grained, surgeme, action, trajectory, segmentation, recognition, parsing. Selected articles were classified based on the level of supervision required for training and divided into different groups representing major frameworks for time series analysis and data modelling. Results: A total of 52 articles were reviewed. The research field is showing rapid expansion, with the majority of articles published in the last 4 years. Deep-learning-based temporal models with discriminative feature extraction and multi-modal data integration have demonstrated promising results on small surgical datasets. Currently, unsupervised methods perform significantly less well than the supervised approaches. Conclusion: The development of large and diverse open-source datasets of annotated demonstrations is essential for development and validation of robust solutions for surgical gesture recognition. While new strategies for discriminative feature extraction and knowledge transfer, or unsupervised and semi-supervised approaches, can mitigate the need for data and labels, they have not yet been demonstrated to achieve comparable performance. Important future research directions include detection and forecast of gesture-specific errors and anomalies. Significance: This paper is a comprehensive and structured analysis of surgical gesture recognition methods aiming to summarize the status of this rapidly evolving field.

Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

Zero-shot prompt-based video encoder for surgical gesture recognition

Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Text Promptable Surgical Instrument Segmentation with Vision-Language Models

Towards Accurate and Interpretable Surgical Skill Assessment: A Video-Based Method Incorporating Recognized Surgical Gestures and Skill Levels

Using hand pose estimation to automate open surgery training feedback

Towards Accurate and Interpretable Surgical Skill Assessment: a Video-Based Method for Skill Score Prediction and Guiding Feedback Generation

Gesture Recognition in Robotic Surgery With Multimodal Attention

Surgical gesture classification from video and kinematic data

Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation

Prompt-based Zero-shot Video Moment Retrieval

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Quantification of Robotic Surgeries with Vision-Based Deep Learning

Gesture Recognition in Robotic Surgery: A Review

Surgical Phase Recognition in Inguinal Hernia Repair—AI-Based Confirmatory Baseline and Exploration of Competitive Models

General surgery vision transformer: A video pre-trained foundation model for general surgery

Dual modality prompt learning for visual question-grounded answering in robotic surgery

Surgical gesture classification from video data

Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video