VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Lingxiao Luo,Bingda Tang,Xuanzhong Chen,Rong Han,Ting Chen
2024-10-16
Abstract:Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at <a class="link-external link-https" href="https://github.com/function2-llx/MMMM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key challenges in the application of existing Vision Language Models (VLMs) in the medical field: 1. **Single Visual Localization Method**: Most existing VLMs rely on a single visual localization method, whereas complex medical tasks require more diverse localization methods. 2. **Support for Only 2D Images**: Existing VLMs primarily handle 2D images, but a significant portion of medical images are 3D. 3. **Data Scarcity**: The lack of medical data further exacerbates these challenges. To tackle these challenges, the authors propose VividMed, a medical vision language model with diverse visual localization capabilities. VividMed can generate semantic segmentation masks and instance-level bounding boxes, and it supports multiple imaging modalities, including 2D and 3D data. Additionally, VividMed performs well in other common downstream tasks such as Visual Question Answering (VQA) and report generation. ### Main Contributions 1. **Proposing VividMed**: This is an exploratory attempt to endow medical VLMs with diverse visual localization capabilities, enabling report generation and other visual localization tasks based on visual localization. 2. **Designing a Three-Stage Training Process**: To address the data scarcity issue through an automatic data synthesis pipeline, all used datasets and models are from open domains. 3. **Extensive Experimental Validation**: The effectiveness of VividMed is validated through experiments on various downstream tasks. The experimental results show that integrating visual localization capabilities can improve the performance of other tasks.