Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Kun Yuan,Vinkle Srivastav,Tong Yu,Joel L. Lavanchy,Pietro Mascagni,Nassir Navab,Nicolas Padoy

2024-07-23

Abstract:Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics, relying on manually annotated videos to predict fixed object categories. This limits their generalizability to unseen surgical procedures and tasks. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals for multi-modal representation learning, bypassing manual annotations. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions. We introduce SurgVLP - Surgical Vision Language Pre-training - a novel method for multi-modal representation learning. SurgVLP employs a new contrastive learning objective, aligning video clip embeddings with corresponding multiple text embeddings in a joint latent space. We demonstrate the representational capability of this space through several vision-and-language surgical tasks and vision-only tasks specific to surgery. Unlike current fully supervised approaches, SurgVLP adapts to different surgical procedures and tasks without specific fine-tuning, achieving zero-shot adaptation to tasks such as surgical tool, phase, and triplet recognition without manual annotation. These results highlight the transferability and versatility of the learned multi-modal representations in surgical video analysis. The code is available at <a class="link-external link-https" href="https://github.com/CAMMA-public/SurgVLP" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address several key issues in current surgical computer vision applications: 1. **Dependency on Data Annotation**: Existing methods primarily rely on manually annotated surgical videos to predict fixed object categories, which limits their generalization ability to unseen surgical procedures and downstream tasks. 2. **Dataset Limitations**: Current methods are mainly validated on a limited number of single-center, specific surgery datasets, which are insufficient to cover the complexity of the entire surgical process. 3. **Underutilization of Language Information**: Existing methods do not explicitly integrate the rich semantic information from natural language texts into their design, whereas natural language can serve as a natural supervisory signal for visual models, ensuring their high generalizability and usability for diverse tasks. To address these issues, the authors propose a new multimodal representation learning method—SurgVLP (Surgical Vision Language Pre-training). This method leverages a large number of surgical teaching videos available on open surgical e-learning platforms, generates text transcriptions through an automatic speech recognition system, and constructs a new contrastive learning objective that aligns video clip embeddings with corresponding multiple text embeddings in a joint latent space. This approach enables the model to adapt to different surgical procedures and tasks without specific fine-tuning, demonstrating zero-shot adaptation to current vision-only surgical downstream tasks such as surgical tool, phase, and action triplet recognition, without any manual annotation. Additionally, the study introduces various vision-language surgical tasks to evaluate the representation capability of the learned joint latent space.

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

GP-VLS: A general-purpose vision language model for surgery

Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

LoViT: Long Video Transformer for surgical phase recognition

Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows

VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Hierarchical Semi-Supervised Learning Framework for Surgical Gesture Segmentation and Recognition Based on Multi-Modality Data

Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Deep Multimodal Fusion for Surgical Feedback Classification

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Gesture Recognition in Robotic Surgery With Multimodal Attention

General surgery vision transformer: A video pre-trained foundation model for general surgery

SurgPETL: Parameter-Efficient Image-to-Surgical-Video Transfer Learning for Surgical Phase Recognition