Abstract:Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: <a class="link-external link-https" href="https://github.com/YtongXie/Medical-Vision-and-Language-Tasks-and-Methodologies-A-Survey" rel="external noopener nofollow">this https URL</a>.

ViMQ: A Vietnamese Medical Question Dataset for Healthcare Dialogue System Development

SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts

Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts

VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain

New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles

Improving Vietnamese-English Medical Machine Translation

TM-PATHVQA:90000+ Textless Multilingual Questions for Medical Visual Question Answering

A Vietnamese Dataset for Evaluating Machine Reading Comprehension

A dataset for medical instructional video classification and question answering

Real-time Speech Summarization for Medical Conversations

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering

TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

Medical visual question answering: A survey

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges

MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

The Hmong Medical Corpus: a biomedical corpus for a minority language

Medical Spoken Named Entity Recognition

A Survey of Medical Vision-and-Language Applications and Their Techniques