OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu,Kun Yuan,Yaling Shen,Feilong Tang,Xiaohao Xu,Lin Zhou,Wei Li,Ying Chen,Zhongxing Xu,Zelin Peng,Siyuan Yan,Vinkle Srivastav,Diping Song,Tianbin Li,Danli Shi,Jin Ye,Nicolas Padoy,Nassir Navab,Junjun He
2024-11-23
Abstract:Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: **Challenges in ophthalmic surgery video - language pre - training, including the need for complex visual interpretation, procedural skills and advanced medical knowledge, as well as the limited availability of annotated data**. Specifically: 1. **Complex visual interpretation and procedural skills**: Surgical practice involves complex visual interpretation, procedural skills and advanced medical knowledge, which makes surgical vision - language pre - training (VLP) particularly challenging. 2. **Limited availability of annotated data**: Existing surgical video datasets are usually small in scale and insufficiently annotated, making it difficult to meet the requirements of deep learning model training. 3. **Limitations of multi - modal representation learning**: Although in the general field of computer vision, models such as CLIP have demonstrated success in understanding visual concepts through natural language supervision, in surgical multi - modal representation learning, there are still special challenges, such as professional medical terms, limited data volume, etc. To solve these problems, the authors propose OphCLIP, a hierarchical retrieval - enhanced vision - language pre - training framework specifically for understanding the ophthalmic surgery workflow. The main contributions of OphCLIP include: - **Construction of the OphVL dataset**: This is the first large - scale ophthalmic surgery video - text pair dataset, containing more than 375,000 hierarchically structured video - text pairs, covering multiple surgical attributes such as surgical type, stage/operation, instruments, drugs, etc. - **Proposing the OphCLIP framework**: By aligning short video clips with detailed narrative texts and aligning complete videos with high - level title summaries, it enhances fine - grained and long - term visual representation learning. In addition, a retrieval - based enhancement method is introduced, using large - scale silent surgical videos as auxiliary supervision signals to promote knowledge transfer. - **Comprehensive zero - shot evaluation**: Extensive evaluations and ablation studies were carried out on 11 datasets, demonstrating the strong generalization ability of OphCLIP in different tasks. ### Involved formulas 1. **InfoNCE loss function (for segment - level pre - training)**: \[ L_{\text{clip}}^{\text{vl}}=\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(f_v(v_{ij})^\top f_t(n_{ij}))}{\sum_{k = 1}^{B}\exp(f_v(v_{ij})^\top f_t(n_{kj}))} \] where \(B\) is the batch size, positive sample pairs are composed of time - aligned video - text pairs, and other pairs are regarded as negative samples. 2. **SimSiam self - supervised loss function (for further optimizing visual features)**: \[ L_{\text{clip}}^{\text{vv}}=\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(f_v(v_{ij})^\top f_t(n_{ij}))}{\sum_{k = 1}^{B}\exp(f_v(v_{ij})^\top f_v(\text{Aug}(v_{kj})))} \] 3. **Video - level pre - training loss function**: \[ L_{\text{narrative}}^{\text{video}}=\frac{1}{B}\sum_{i = 1}^{B}\log\frac{\exp(f_v(V_i)^\top f_t(T_i))}{\sum_{k = 1}^{B}\exp(f_v(V_i)^\top f_t(T_k))} \] 4. **Retrieval - based contrastive learning loss function (for knowledge transfer of silent videos)**: \[ L_{\text{silent}}^{\text{video}}=\frac{1}{K}\sum_{j = 1}^{K}\log\frac{\exp(f_v(V_i)^\top f_v(\hat{V}_{ij}))+\exp(f_