CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting

Naman Sharma

2024-07-12

Abstract:Recently large vision-language models have shown potential when interpreting complex images and generating natural language descriptions using advanced reasoning. Medicine's inherently multimodal nature incorporating scans and text-based medical histories to write reports makes it conducive to benefit from these leaps in AI capabilities. We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation across several datasets and benchmarks. We use linear probes to evaluate the performance of various components including CheXagent's vision transformer and Q-former, which outperform the industry-standard Torch X-ray Vision models across many different datasets showing robust generalisation capabilities. Importantly, we find that vision-language models often hallucinate with confident language, which slows down clinical interpretation. Based on these findings, we develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools to generate uncertainty-aware radiology reports with pathologies localised and described based on their likelihood. We thoroughly evaluate our vision-language agents using NLP metrics, chest X-ray benchmarks and clinical evaluations by developing an evaluation platform to perform a user study with respiratory specialists. Our results show considerable improvements in accuracy, interpretability and safety of the AI-generated reports. We stress the importance of analysing results for normal and abnormal scans separately. Finally, we emphasise the need for larger paired (scan and report) datasets alongside data augmentation to tackle overfitting seen in these large vision-language models.

Image and Video Processing,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the significant increase in the number of chest X - ray examinations in the UK National Health Service (NHS) due to population aging, which has led to a large backlog of scans awaiting reporting. Specifically, the goals of the paper include: 1. **Understanding and evaluating the application of state - of - the - art (SOTA) large - scale vision - language models (VLMs) in chest X - ray (CXR) interpretation**: The paper evaluates the performance of existing state - of - the - art vision - language models in interpreting chest X - ray images through multiple datasets and benchmark tests. 2. **Collaborating with clinical experts to understand the barriers or deficiencies of these VLMs in entering clinical applications**: By collaborating with medical professionals, identify the problems faced by current technologies in practical applications, such as model uncertainty, hallucination phenomena, etc. 3. **Improving the existing state - of - the - art in static chest X - ray interpretation (i.e., images without prior scan comparison) under data and computational resource constraints**: Pay special attention to improving clinical interpretability, and develop methods that can generate radiology reports with uncertainty, in order to reduce the workload of clinicians and improve the accuracy of diagnosis. Through these goals, the paper aims to use advanced vision - language model technologies to improve the automatic interpretation and report generation process of chest X - ray images, thereby alleviating the pressure in the medical system and improving the efficiency of medical services.

CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation

Deep neural models for automated multi-task diagnostic scan management—quality enhancement, view classification and report generation

An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

Collaboration between clinicians and vision–language models in radiology report generation

CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

Vision-Language Generative Model for View-Specific Chest X-ray Generation

A vision–language foundation model for the generation of realistic chest X-ray images

Vispi: Automatic Visual Perception and Interpretation of Chest X-rays

Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report Generation

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation

RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders

Automatically Generating Narrative-Style Radiology Reports from Volumetric CT Images; a Proof of Concept