Abstract:Background: The automatic generation of radiology reports, which seeks to create a free-text description from a clinical radiograph, is emerging as a pivotal intersection between clinical medicine and artificial intelligence. Leveraging natural language processing technologies can accelerate report creation, enhancing health care quality and standardization. However, most existing studies have not yet fully tapped into the combined potential of advanced language and vision models. Objective: The purpose of this study was to explore the integration of pretrained vision-language models into radiology report generation. This would enable the vision-language model to automatically convert clinical images into high-quality textual reports. Methods: In our research, we introduced a radiology report generation model named ClinicalBLIP, building upon the foundational InstructBLIP model and refining it using clinical image-to-text data sets. A multistage fine-tuning approach via low-rank adaptation was proposed to deepen the semantic comprehension of the visual encoder and the large language model for clinical imagery. Furthermore, prior knowledge was integrated through prompt learning to enhance the precision of the reports generated. Experiments were conducted on both the IU X-RAY and MIMIC-CXR data sets, with ClinicalBLIP compared to several leading methods. Results: Experimental results revealed that ClinicalBLIP obtained superior scores of 0.570/0.365 and 0.534/0.313 on the IU X-RAY/MIMIC-CXR test sets for the Metric for Evaluation of Translation with Explicit Ordering (METEOR) and the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) evaluations, respectively. This performance notably surpasses that of existing state-of-the-art methods. Further evaluations confirmed the effectiveness of the multistage fine-tuning and the integration of prior information, leading to substantial improvements. Conclusions: The proposed ClinicalBLIP model demonstrated robustness and effectiveness in enhancing clinical radiology report generation, suggesting significant promise for real-world clinical applications.

Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

MedicalCLIP: Anomaly-Detection Domain Generalization with Asymmetric Constraints

CLIP in Medical Imaging: A Comprehensive Survey

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training in Radiology

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study

CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning

Visual Prompt Engineering for Medical Vision Language Models in Radiology

Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training (CLIP) for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Anatomical Structure-Guided Medical Vision-Language Pre-training