CLIP in Medical Imaging: A Comprehensive Survey

Zihao Zhao,Yuxiao Liu,Han Wu,Mei Wang,Yonghao Li,Sheng Wang,Lin Teng,Disheng Liu,Zhiming Cui,Qian Wang,Dinggang Shen
2024-08-10
Abstract:Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training paradigm, successfully introduces text supervision to vision models. It has shown promising results across various tasks, attributable to its generalizability and interpretability. The use of CLIP has recently gained increasing interest in the medical imaging domain, serving both as a pre-training paradigm for aligning medical vision and language, and as a critical component in diverse clinical tasks. With the aim of facilitating a deeper understanding of this promising direction, this survey offers an in-depth exploration of the CLIP paradigm within the domain of medical imaging, regarding both refined CLIP pre-training and CLIP-driven applications. In this study, We (1) start with a brief introduction to the fundamentals of CLIP methodology. (2) Then, we investigate the adaptation of CLIP pre-training in the medical domain, focusing on how to optimize CLIP given characteristics of medical images and reports. (3) Furthermore, we explore the practical utilization of CLIP pre-trained models in various tasks, including classification, dense prediction, and cross-modal tasks. (4) Finally, we discuss existing limitations of CLIP in the context of medical imaging and propose forward-looking directions to address the demands of medical imaging domain. We expect that this comprehensive survey will provide researchers in the field of medical image analysis with a holistic understanding of the CLIP paradigm and its potential implications. The project page can be found on <a class="link-external link-https" href="https://github.com/zhaozh10/Awesome-CLIP-in-Medical-Imaging" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily aims to address the application and development of Contrastive Language-Image Pretraining (CLIP) in the field of medical imaging. Specifically, the objectives of the paper include the following aspects: 1. **Introduction to the basic principles of CLIP**: First, it briefly introduces the foundational knowledge of the CLIP method, which is a pretraining paradigm that learns interpretable visual representations through text supervision. 2. **Adapting CLIP pretraining to the field of medical imaging**: It explores how to optimize CLIP to suit the characteristics of medical imaging, particularly how to effectively pretrain on medical imaging datasets. 3. **Exploring CLIP-driven applications**: It discusses how to utilize the pretrained CLIP model to improve the performance of various clinical tasks, such as classification, dense prediction (e.g., segmentation), and cross-modal tasks. 4. **Discussing existing limitations and future directions**: It analyzes the current limitations of CLIP in the field of medical imaging and proposes forward-looking research directions to address these needs. The paper also mentions the growing trend of applying CLIP in the field of medical imaging and how it meets the healthcare sector's demand for interpretable artificial intelligence. Additionally, it compares other related review articles and emphasizes that the unique contribution of this paper lies in its comprehensive coverage of both technical details and clinical applications. In summary, this paper aims to provide researchers with a comprehensive review of the potential applications of CLIP in the field of medical imaging, while also pointing out the key challenges and development trends in this area.