Abstract:Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.

A vision–language foundation model for the generation of realistic chest X-ray images

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

Vision-Language Generative Model for View-Specific Chest X-ray Generation

Exploring Foundation Models for Synthetic Medical Imaging: A Study on Chest X-Rays and Fine-Tuning Techniques

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

XReal: Realistic Anatomy and Pathology-Aware X-ray Generation via Controllable Diffusion Model

Vision-Language Synthetic Data Enhances Echocardiography Downstream Tasks

Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images

A Critical Assessment of Generative Models for Synthetic Data Augmentation on Limited Pneumonia X-ray Data

Synthetic chest X-ray images from text prompts

CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting

Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation

Cascaded Latent Diffusion Models for High-Resolution Chest X-ray Synthesis

Medical Vision-Language Pre-Training for Brain Abnormalities

Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models

Generating Realistic X-ray Scattering Images Using Stable Diffusion and Human-in-the-loop Annotations