A Survey of Medical Vision-and-Language Applications and Their Techniques

Qi Chen,Ruoshan Zhao,Sinuo Wang,Vu Minh Hieu Phan,Anton van den Hengel,Johan Verjans,Zhibin Liao,Minh-Son To,Yong Xia,Jian Chen,Yutong Xie,Qi Wu
2024-11-19
Abstract:Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: <a class="link-external link-https" href="https://github.com/YtongXie/Medical-Vision-and-Language-Tasks-and-Methodologies-A-Survey" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the application of Medical Vision - and - Language Models (MVLMs) in the medical field and their technical challenges. Specifically, the paper focuses on the following aspects: 1. **Data Growth and Demand**: With the exponential growth of medical data, especially multi - modal data, there is an urgent need for medical vision - and - language models that can integrate computer vision and natural language processing to utilize the complementary features in the data to improve medical planning, prediction, diagnosis and treatment. 2. **Model Capability**: MVLMs aim to provide a natural language interface for interpreting complex medical data. They can automatically extract and interpret key information in medical images and text reports to support clinical decision - making. 3. **Scope of Application**: The paper analyzes in detail various architectures of MVLMs, focusing on their different strategies in cross - modal integration/ utilization of medical vision and text features. The applications of these models include automatic generation of medical reports, medical visual question answering, medical multi - modal segmentation, diagnosis and prognosis, and medical image - text retrieval. 4. **Data Sets and Evaluation**: The paper also examines the data sets used for these tasks and compares the performance of different models based on standardized evaluation metrics. 5. **Challenges and Future Directions**: The paper points out several challenges in developing large - scale medical vision - and - language models, including difficulties in data collection, data heterogeneity, handling of unbalanced data sets, and model interpretability and credibility. At the same time, the paper summarizes future research trends and directions. Through these analyses, the paper aims to provide a comprehensive review for AI researchers, clinicians and healthcare professionals, promoting interdisciplinary cooperation and the development of innovative solutions to enhance clinical practice.