Abstract:Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models' performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to systematically evaluate the sensitivity of three mainstream Medical Vision - Language Pretraining (MedVLP) models to different text prompts in zero - shot classification tasks. Specifically, the authors focus on the following issues: 1. **Sensitivity of models to different text prompt styles**: - Current MedVLP models have unstable performance when using different styles of text prompts. Ideally, MedVLP models should be able to provide consistent results for various disease categories, regardless of the prompt style (e.g., simplified disease names or detailed descriptions). However, existing research has not fully explored this sensitivity. 2. **Ability to understand complex medical concepts**: - Existing MedVLP models have difficulties in dealing with complex medical concepts. When the interpretability of the prompts increases, the performance of the models is affected, indicating that they face challenges in understanding complex medical terms and descriptions. 3. **Zero - shot reasoning ability**: - For unseen disease categories, MedVLP models should be able to learn from detailed, highly interpretable text prompts and improve prediction accuracy. However, the capabilities of existing models in this regard are not clear. To evaluate these issues, the authors designed six different styles of text prompts and conducted experiments on three publicly available benchmark datasets. These prompt styles include: disease names, symptom descriptions, attribute descriptions, general English descriptions, radiologist - style descriptions, and medical - style descriptions. Through these experiments, the authors hope to reveal the limitations of current MedVLP models and provide improvement suggestions for future research. ### Main findings 1. **Performance fluctuations**: - All evaluated MedVLP models show significant fluctuations in performance under different prompt styles, indicating that they lack robustness to diverse prompt styles. 2. **Understanding of complex medical concepts**: - The models show difficulties in dealing with complex medical concepts, especially when the interpretability of the prompts increases, the performance drops significantly. 3. **Zero - shot reasoning ability**: - Only some models (such as MedKLIP) show the ability to utilize highly interpretable prompts for unseen disease categories, while other models (such as BioViL and KAD) have relatively limited performance in this regard. ### Conclusions and suggestions Based on the above findings, the authors propose suggestions for improving MedVLP models: - **Incorporate domain - knowledge - enhancement methods**: Use external knowledge bases, such as UMLS, to incorporate medical - domain knowledge into the models to improve zero - shot diagnosis performance. - **Use information - rich texts for pre - training**: The pre - training stage should include more descriptive and highly interpretable text prompts so that the models can better utilize this information during reasoning. - **Ensure the diversity of text styles in the pre - training dataset**: The pre - training dataset should cover various text styles from simple disease names to detailed descriptions to enhance the adaptability and robustness of the models. These improvement measures are expected to improve the performance and stability of MedVLP models when dealing with diverse zero - shot prompts.

How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Exploring low-resource medical image classification with weakly supervised prompt learning

Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic Cross-modal Learnable Prompts

Pseudo-Prompt Generating in Pre-trained Vision-Language Models for Multi-Label Medical Image Classification

MCPL: Multi-modal Collaborative Prompt Learning for Medical Vision-Language Model

Improving Medical Vision-Language Contrastive Pretraining with Semantics-aware Triage

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization

Aligning Medical Images with General Knowledge from Large Language Models

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Targeted Visual Prompting for Medical Visual Question Answering

Unified Medical Image-Text-Label Contrastive Learning With Continuous Prompt

VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine