Medical Vision Generalist: Unifying Medical Imaging Tasks in Context

Sucheng Ren,Xiaoke Huang,Xianhang Li,Junfei Xiao,Jieru Mei,Zeyu Wang,Alan Yuille,Yuyin Zhou
2024-06-09
Abstract:This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks -- such as cross-modal synthesis, image segmentation, denoising, and inpainting -- within a unified image-to-image generation framework. Specifically, MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images. By treating these tasks as an image generation process conditioned on prompt image-label pairs and input images, this approach enables a flexible unification of various tasks, even those spanning different modalities and datasets. To capitalize on both local and global context, we design a hybrid method combining masked image modeling with autoregressive training for conditional image generation. This hybrid approach yields the most robust performance across all involved medical imaging tasks. To rigorously evaluate MVG's capabilities, we curated the first comprehensive generalist medical vision benchmark, comprising 13 datasets and spanning four imaging modalities (CT, MRI, X-ray, and micro-ultrasound). Our results consistently establish MVG's superior performance, outperforming existing vision generalists, such as Painter and LVM. Furthermore, MVG exhibits strong scalability, with its performance demonstrably improving when trained on a more diverse set of tasks, and can be effectively adapted to unseen datasets with only minimal task-specific samples. The code is available at \url{<a class="link-external link-https" href="https://github.com/OliverRensu/MVG" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper proposes the "Medical Vision Generalist" (MVG), which is the first foundational model capable of handling various medical imaging tasks, including cross-modal synthesis, image segmentation, denoising, and restoration. MVG adopts a context generation strategy that normalizes inputs and outputs into image forms, enabling various tasks (even across different modalities and datasets) to be processed uniformly. To integrate local and global contexts, the researchers designed a hybrid approach that combines masked image modeling and autoregressive training for conditional image generation. In the experiments, the researchers created a comprehensive benchmark for medical imaging, including 13 datasets covering four imaging modalities (CT, MRI, X-ray, and micro-ultrasound) and multiple anatomical regions. MVG outperforms existing general models (such as Painter and LVM) on these tasks and demonstrates good scalability and adaptability to new datasets with only a few specific examples required for generalization. Furthermore, MVG unifies the input/output space of tasks by using three different colorization methods (binary, predefined, and random colorization), avoiding reliance on label values and promoting the learning of the model from context rather than color cues. Through conditional image generation, MVG unifies various tasks into an image-to-image generation framework, where the output of the task is generated based on task cues and sample images. In summary, the paper aims to address the problem of constructing a universal medical imaging model that can effectively perform various medical image analysis tasks without the need for separate training for each task or dataset.