Abstract:Background: The automatic generation of radiology reports, which seeks to create a free-text description from a clinical radiograph, is emerging as a pivotal intersection between clinical medicine and artificial intelligence. Leveraging natural language processing technologies can accelerate report creation, enhancing health care quality and standardization. However, most existing studies have not yet fully tapped into the combined potential of advanced language and vision models. Objective: The purpose of this study was to explore the integration of pretrained vision-language models into radiology report generation. This would enable the vision-language model to automatically convert clinical images into high-quality textual reports. Methods: In our research, we introduced a radiology report generation model named ClinicalBLIP, building upon the foundational InstructBLIP model and refining it using clinical image-to-text data sets. A multistage fine-tuning approach via low-rank adaptation was proposed to deepen the semantic comprehension of the visual encoder and the large language model for clinical imagery. Furthermore, prior knowledge was integrated through prompt learning to enhance the precision of the reports generated. Experiments were conducted on both the IU X-RAY and MIMIC-CXR data sets, with ClinicalBLIP compared to several leading methods. Results: Experimental results revealed that ClinicalBLIP obtained superior scores of 0.570/0.365 and 0.534/0.313 on the IU X-RAY/MIMIC-CXR test sets for the Metric for Evaluation of Translation with Explicit Ordering (METEOR) and the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) evaluations, respectively. This performance notably surpasses that of existing state-of-the-art methods. Further evaluations confirmed the effectiveness of the multistage fine-tuning and the integration of prior information, leading to substantial improvements. Conclusions: The proposed ClinicalBLIP model demonstrated robustness and effectiveness in enhancing clinical radiology report generation, suggesting significant promise for real-world clinical applications.

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting

CheXpedition: Investigating Generalization Challenges for Translation of Chest X-Ray Algorithms to the Clinical Setting

Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation

CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

ChEX: Interactive Localization and Region Description in Chest X-rays

Vision-Language Generative Model for View-Specific Chest X-ray Generation

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

A vision–language foundation model for the generation of realistic chest X-ray images

ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders

Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation

Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study

CheXternal: Generalization of Deep Learning Models for Chest X-ray Interpretation to Photos of Chest X-rays and External Clinical Settings

Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

Clinically Accurate Chest X-Ray Report Generation

An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation