Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

Kun Yuan,Vinkle Srivastav,Tong Yu,Joel L. Lavanchy,Pietro Mascagni,Nassir Navab,Nicolas Padoy
2024-07-23
Abstract:Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics, relying on manually annotated videos to predict fixed object categories. This limits their generalizability to unseen surgical procedures and tasks. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals for multi-modal representation learning, bypassing manual annotations. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions. We introduce SurgVLP - Surgical Vision Language Pre-training - a novel method for multi-modal representation learning. SurgVLP employs a new contrastive learning objective, aligning video clip embeddings with corresponding multiple text embeddings in a joint latent space. We demonstrate the representational capability of this space through several vision-and-language surgical tasks and vision-only tasks specific to surgery. Unlike current fully supervised approaches, SurgVLP adapts to different surgical procedures and tasks without specific fine-tuning, achieving zero-shot adaptation to tasks such as surgical tool, phase, and triplet recognition without manual annotation. These results highlight the transferability and versatility of the learned multi-modal representations in surgical video analysis. The code is available at <a class="link-external link-https" href="https://github.com/CAMMA-public/SurgVLP" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address several key issues in current surgical computer vision applications: 1. **Dependency on Data Annotation**: Existing methods primarily rely on manually annotated surgical videos to predict fixed object categories, which limits their generalization ability to unseen surgical procedures and downstream tasks. 2. **Dataset Limitations**: Current methods are mainly validated on a limited number of single-center, specific surgery datasets, which are insufficient to cover the complexity of the entire surgical process. 3. **Underutilization of Language Information**: Existing methods do not explicitly integrate the rich semantic information from natural language texts into their design, whereas natural language can serve as a natural supervisory signal for visual models, ensuring their high generalizability and usability for diverse tasks. To address these issues, the authors propose a new multimodal representation learning method—SurgVLP (Surgical Vision Language Pre-training). This method leverages a large number of surgical teaching videos available on open surgical e-learning platforms, generates text transcriptions through an automatic speech recognition system, and constructs a new contrastive learning objective that aligns video clip embeddings with corresponding multiple text embeddings in a joint latent space. This approach enables the model to adapt to different surgical procedures and tasks without specific fine-tuning, demonstrating zero-shot adaptation to current vision-only surgical downstream tasks such as surgical tool, phase, and action triplet recognition, without any manual annotation. Additionally, the study introduces various vision-language surgical tasks to evaluate the representation capability of the learned joint latent space.