Abstract:Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: <a class="link-external link-https" href="https://github.com/YtongXie/Medical-Vision-and-Language-Tasks-and-Methodologies-A-Survey" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the application of Medical Vision - and - Language Models (MVLMs) in the medical field and their technical challenges. Specifically, the paper focuses on the following aspects: 1. **Data Growth and Demand**: With the exponential growth of medical data, especially multi - modal data, there is an urgent need for medical vision - and - language models that can integrate computer vision and natural language processing to utilize the complementary features in the data to improve medical planning, prediction, diagnosis and treatment. 2. **Model Capability**: MVLMs aim to provide a natural language interface for interpreting complex medical data. They can automatically extract and interpret key information in medical images and text reports to support clinical decision - making. 3. **Scope of Application**: The paper analyzes in detail various architectures of MVLMs, focusing on their different strategies in cross - modal integration/ utilization of medical vision and text features. The applications of these models include automatic generation of medical reports, medical visual question answering, medical multi - modal segmentation, diagnosis and prognosis, and medical image - text retrieval. 4. **Data Sets and Evaluation**: The paper also examines the data sets used for these tasks and compares the performance of different models based on standardized evaluation metrics. 5. **Challenges and Future Directions**: The paper points out several challenges in developing large - scale medical vision - and - language models, including difficulties in data collection, data heterogeneity, handling of unbalanced data sets, and model interpretability and credibility. At the same time, the paper summarizes future research trends and directions. Through these analyses, the paper aims to provide a comprehensive review for AI researchers, clinicians and healthcare professionals, promoting interdisciplinary cooperation and the development of innovative solutions to enhance clinical practice.

A Survey of Medical Vision-and-Language Applications and Their Techniques

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Visual–language Foundation Models in Medicine

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

Medical Vision-Language Pre-Training for Brain Abnormalities

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Vision-Language Models for Vision Tasks: A Survey

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Vision language models in ophthalmology

Visual Question Answering in Ophthalmology: A Progressive and Practical Perspective

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

A survey on advancements in image-text multimodal models: From general techniques to biomedical implementations

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI