Abstract:Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: **Challenges in ophthalmic surgery video - language pre - training, including the need for complex visual interpretation, procedural skills and advanced medical knowledge, as well as the limited availability of annotated data**. Specifically: 1. **Complex visual interpretation and procedural skills**: Surgical practice involves complex visual interpretation, procedural skills and advanced medical knowledge, which makes surgical vision - language pre - training (VLP) particularly challenging. 2. **Limited availability of annotated data**: Existing surgical video datasets are usually small in scale and insufficiently annotated, making it difficult to meet the requirements of deep learning model training. 3. **Limitations of multi - modal representation learning**: Although in the general field of computer vision, models such as CLIP have demonstrated success in understanding visual concepts through natural language supervision, in surgical multi - modal representation learning, there are still special challenges, such as professional medical terms, limited data volume, etc. To solve these problems, the authors propose OphCLIP, a hierarchical retrieval - enhanced vision - language pre - training framework specifically for understanding the ophthalmic surgery workflow. The main contributions of OphCLIP include: - **Construction of the OphVL dataset**: This is the first large - scale ophthalmic surgery video - text pair dataset, containing more than 375,000 hierarchically structured video - text pairs, covering multiple surgical attributes such as surgical type, stage/operation, instruments, drugs, etc. - **Proposing the OphCLIP framework**: By aligning short video clips with detailed narrative texts and aligning complete videos with high - level title summaries, it enhances fine - grained and long - term visual representation learning. In addition, a retrieval - based enhancement method is introduced, using large - scale silent surgical videos as auxiliary supervision signals to promote knowledge transfer. - **Comprehensive zero - shot evaluation**: Extensive evaluations and ablation studies were carried out on 11 datasets, demonstrating the strong generalization ability of OphCLIP in different tasks. ### Involved formulas 1. **InfoNCE loss function (for segment - level pre - training)**: \[ L_{\text{clip}}^{\text{vl}}=\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(f_v(v_{ij})^\top f_t(n_{ij}))}{\sum_{k = 1}^{B}\exp(f_v(v_{ij})^\top f_t(n_{kj}))} \] where \(B\) is the batch size, positive sample pairs are composed of time - aligned video - text pairs, and other pairs are regarded as negative samples. 2. **SimSiam self - supervised loss function (for further optimizing visual features)**: \[ L_{\text{clip}}^{\text{vv}}=\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(f_v(v_{ij})^\top f_t(n_{ij}))}{\sum_{k = 1}^{B}\exp(f_v(v_{ij})^\top f_v(\text{Aug}(v_{kj})))} \] 3. **Video - level pre - training loss function**: \[ L_{\text{narrative}}^{\text{video}}=\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(f_v(V_i)^\top f_t(T_i))}{\sum_{k = 1}^{B}\exp(f_v(V_i)^\top f_t(T_k))} \] 4. **Retrieval - based contrastive learning loss function (for knowledge transfer of silent videos)**: \[ L_{\text{silent}}^{\text{video}}=\frac{1}{K}\sum_{j = 1}^{K}\log\frac{\exp(f_v(V_i)^\top f_v(\hat{V}_{ij}))+\exp(f_

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation

VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery

Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

CLIP in Medical Imaging: A Comprehensive Survey

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

LoViT: Long Video Transformer for surgical phase recognition

OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

OphGLM: An ophthalmology large language-and-vision assistant

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

Cataract-1K Dataset for Deep-Learning-Assisted Analysis of Cataract Surgery Videos

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining