CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting

Naman Sharma
2024-07-12
Abstract:Recently large vision-language models have shown potential when interpreting complex images and generating natural language descriptions using advanced reasoning. Medicine's inherently multimodal nature incorporating scans and text-based medical histories to write reports makes it conducive to benefit from these leaps in AI capabilities. We evaluate the publicly available, state of the art, foundational vision-language models for chest X-ray interpretation across several datasets and benchmarks. We use linear probes to evaluate the performance of various components including CheXagent's vision transformer and Q-former, which outperform the industry-standard Torch X-ray Vision models across many different datasets showing robust generalisation capabilities. Importantly, we find that vision-language models often hallucinate with confident language, which slows down clinical interpretation. Based on these findings, we develop an agent-based vision-language approach for report generation using CheXagent's linear probes and BioViL-T's phrase grounding tools to generate uncertainty-aware radiology reports with pathologies localised and described based on their likelihood. We thoroughly evaluate our vision-language agents using NLP metrics, chest X-ray benchmarks and clinical evaluations by developing an evaluation platform to perform a user study with respiratory specialists. Our results show considerable improvements in accuracy, interpretability and safety of the AI-generated reports. We stress the importance of analysing results for normal and abnormal scans separately. Finally, we emphasise the need for larger paired (scan and report) datasets alongside data augmentation to tackle overfitting seen in these large vision-language models.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the significant increase in the number of chest X - ray examinations in the UK National Health Service (NHS) due to population aging, which has led to a large backlog of scans awaiting reporting. Specifically, the goals of the paper include: 1. **Understanding and evaluating the application of state - of - the - art (SOTA) large - scale vision - language models (VLMs) in chest X - ray (CXR) interpretation**: The paper evaluates the performance of existing state - of - the - art vision - language models in interpreting chest X - ray images through multiple datasets and benchmark tests. 2. **Collaborating with clinical experts to understand the barriers or deficiencies of these VLMs in entering clinical applications**: By collaborating with medical professionals, identify the problems faced by current technologies in practical applications, such as model uncertainty, hallucination phenomena, etc. 3. **Improving the existing state - of - the - art in static chest X - ray interpretation (i.e., images without prior scan comparison) under data and computational resource constraints**: Pay special attention to improving clinical interpretability, and develop methods that can generate radiology reports with uncertainty, in order to reduce the workload of clinicians and improve the accuracy of diagnosis. Through these goals, the paper aims to use advanced vision - language model technologies to improve the automatic interpretation and report generation process of chest X - ray images, thereby alleviating the pressure in the medical system and improving the efficiency of medical services.