VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Lingxiao Luo,Bingda Tang,Xuanzhong Chen,Rong Han,Ting Chen

2024-10-16

Abstract:Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at <a class="link-external link-https" href="https://github.com/function2-llx/MMMM" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key challenges in the application of existing Vision Language Models (VLMs) in the medical field: 1. **Single Visual Localization Method**: Most existing VLMs rely on a single visual localization method, whereas complex medical tasks require more diverse localization methods. 2. **Support for Only 2D Images**: Existing VLMs primarily handle 2D images, but a significant portion of medical images are 3D. 3. **Data Scarcity**: The lack of medical data further exacerbates these challenges. To tackle these challenges, the authors propose VividMed, a medical vision language model with diverse visual localization capabilities. VividMed can generate semantic segmentation masks and instance-level bounding boxes, and it supports multiple imaging modalities, including 2D and 3D data. Additionally, VividMed performs well in other common downstream tasks such as Visual Question Answering (VQA) and report generation. ### Main Contributions 1. **Proposing VividMed**: This is an exploratory attempt to endow medical VLMs with diverse visual localization capabilities, enabling report generation and other visual localization tasks based on visual localization. 2. **Designing a Three-Stage Training Process**: To address the data scarcity issue through an automatic data synthesis pipeline, all used datasets and models are from open domains. 3. **Extensive Experimental Validation**: The effectiveness of VividMed is validated through experiments on various downstream tasks. The experimental results show that integrating visual localization capabilities can improve the performance of other tasks.

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Visual–language Foundation Models in Medicine

A Survey of Medical Vision-and-Language Applications and Their Techniques

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Medical Vision-Language Pre-Training for Brain Abnormalities

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation

MOSS-MED: Medical Multimodal Model Serving Medical Image Analysis

MG-3D: Multi-Grained Knowledge-Enhanced 3D Medical Vision-Language Pre-training

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

MedDr: Diagnosis-Guided Bootstrapping for Large-Scale Medical Vision-Language Learning

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue