Abstract:In the rapidly evolving landscape of medical imaging, the integration of artificial intelligence (AI) with clinical expertise offers unprecedented opportunities to enhance diagnostic precision and accuracy. Yet, the "black box" nature of AI models often limits their integration into clinical practice, where transparency and interpretability are important. This paper presents a novel system leveraging the Large Multimodal Model (LMM) to bridge the gap between AI predictions and the cognitive processes of radiologists. This system consists of two core modules, Temporally Grounded Intention Detection (TGID) and Region Extraction (RE). The TGID module predicts the radiologist's intentions by analyzing eye gaze fixation heatmap videos and corresponding radiology reports. Additionally, the RE module extracts regions of interest that align with these intentions, mirroring the radiologist's diagnostic focus. This approach introduces a new task, radiologist intention detection, and is the first application of Dense Video Captioning (DVC) in the medical domain. By making AI systems more interpretable and aligned with radiologist's cognitive processes, this proposed system aims to enhance trust, improve diagnostic accuracy, and support medical education. Additionally, it holds the potential for automated error correction, guiding junior radiologists, and fostering more effective training and feedback mechanisms. This work sets a precedent for future research in AI-driven healthcare, offering a pathway towards transparent, trustworthy, and human-centered AI systems. We evaluated this model using NLG(Natural Language Generation), time-related, and vision-based metrics, demonstrating superior performance in generating temporally grounded intentions on REFLACX and EGD-CXR datasets. This model also demonstrated strong predictive accuracy in overlap scores for medical abnormalities and effective region extraction with high IoU(Intersection over Union), especially in complex cases like cardiomegaly and edema. These results highlight the system's potential to enhance diagnostic accuracy and support continuous learning in radiology. We are also releasing the source code for our project, available here. Graphical abstract Download: Download high-res image (138KB) Download: Download full-size image Overview of our proposed system, comprising two key submodules: Temporally Grounded Intention Detection (TGID) and Region Extraction (RE). The system processes eye gaze fixation video overlaid on CXR images alongside the corresponding radiology report, ultimately identifying the intended diagnosis and highlighting the associated Regions of Interest (ROI).

ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models

Bridging Human and Machine Intelligence: Reverse-Engineering Radiologist Intentions for Clinical Trust and Adoption

A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

CXR-Agent: Vision-language models for chest X-ray interpretation with uncertainty aware radiology reporting

SyntheX: Scaling Up Learning-based X-ray Image Analysis Through In Silico Experiments

I-AI: A Controllable & Interpretable AI System for Decoding Radiologists' Intense Focus for Accurate CXR Diagnoses

Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation

An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation

M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

AI Accelerated Human-in-the-loop Structuring of Radiology Reports

EVA-X: A Foundation Model for General Chest X-ray Analysis with Self-supervised Learning

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

X-TRA: Improving Chest X-ray Tasks with Cross-Modal Retrieval Augmentation

Development and Multicenter Validation of Chest X-ray Radiography Interpretations Based on Natural Language Processing.

Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning