Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

Sucheng Ren,Xiaoke Huang,Xianhang Li,Junfei Xiao,Jieru Mei,Zeyu Wang,Alan Yuille,Yuyin Zhou

2024-06-09

Abstract:This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks -- such as cross-modal synthesis, image segmentation, denoising, and inpainting -- within a unified image-to-image generation framework. Specifically, MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images. By treating these tasks as an image generation process conditioned on prompt image-label pairs and input images, this approach enables a flexible unification of various tasks, even those spanning different modalities and datasets. To capitalize on both local and global context, we design a hybrid method combining masked image modeling with autoregressive training for conditional image generation. This hybrid approach yields the most robust performance across all involved medical imaging tasks. To rigorously evaluate MVG's capabilities, we curated the first comprehensive generalist medical vision benchmark, comprising 13 datasets and spanning four imaging modalities (CT, MRI, X-ray, and micro-ultrasound). Our results consistently establish MVG's superior performance, outperforming existing vision generalists, such as Painter and LVM. Furthermore, MVG exhibits strong scalability, with its performance demonstrably improving when trained on a more diverse set of tasks, and can be effectively adapted to unseen datasets with only minimal task-specific samples. The code is available at \url{<a class="link-external link-https" href="https://github.com/OliverRensu/MVG" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper proposes the "Medical Vision Generalist" (MVG), which is the first foundational model capable of handling various medical imaging tasks, including cross-modal synthesis, image segmentation, denoising, and restoration. MVG adopts a context generation strategy that normalizes inputs and outputs into image forms, enabling various tasks (even across different modalities and datasets) to be processed uniformly. To integrate local and global contexts, the researchers designed a hybrid approach that combines masked image modeling and autoregressive training for conditional image generation. In the experiments, the researchers created a comprehensive benchmark for medical imaging, including 13 datasets covering four imaging modalities (CT, MRI, X-ray, and micro-ultrasound) and multiple anatomical regions. MVG outperforms existing general models (such as Painter and LVM) on these tasks and demonstrates good scalability and adaptability to new datasets with only a few specific examples required for generalization. Furthermore, MVG unifies the input/output space of tasks by using three different colorization methods (binary, predefined, and random colorization), avoiding reliance on label values and promoting the learning of the model from context rather than color cues. Through conditional image generation, MVG unifies various tasks into an image-to-image generation framework, where the output of the task is generated based on task cues and sample images. In summary, the paper aims to address the problem of constructing a universal medical imaging model that can effectively perform various medical image analysis tasks without the need for separate training for each task or dataset.

Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

A Survey of Medical Vision-and-Language Applications and Their Techniques

A Generalist Learner for Multifaceted Medical Image Interpretation

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Medical Vision-Language Pre-Training for Brain Abnormalities

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge

Generative Medical Segmentation

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-Shot Medical Segmentation

MGA: Medical generalist agent through text-guided knowledge transformation