Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation

Wenting Chen,Linlin Shen,Jingyang Lin,Jiebo Luo,Xiang Li,Yixuan Yuan
2024-06-04
Abstract:To address these issues, we propose a novel Adaptive patch-word Matching (AdaMatch) model to correlate chest X-ray (CXR) image regions with words in medical reports and apply it to CXR-report generation to provide explainability for the generation process. AdaMatch exploits the fine-grained relation between adaptive patches and words to provide explanations of specific image regions with corresponding words. To capture the abnormal regions of varying sizes and positions, we introduce the Adaptive Patch extraction (AdaPatch) module to acquire the adaptive patches for these regions adaptively. In order to provide explicit explainability for CXR-report generation task, we propose an AdaMatch-based bidirectional large language model for Cyclic CXR-report generation (AdaMatch-Cyclic). It employs the AdaMatch to obtain the keywords for CXR images and `keypatches' for medical reports as hints to guide CXR-report generation. Extensive experiments on two publicly available CXR datasets prove the effectiveness of our method and its superior performance to existing methods.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the fine-grained image-text alignment problem in medical image analysis, specifically establishing clear and specific associations between chest X-ray (CXR) images and their corresponding medical reports. Specifically, the paper identifies two main issues with current visual language models (VLM) when handling medical images: 1. **Incomplete representation due to fixed segmentation**: Existing methods typically rely on predefined fixed image patches, which can lead to incomplete or blurred representations when dealing with lesion areas of varying sizes and locations. 2. **Lack of interpretability**: The explanations provided by current models (such as heatmaps) can only show image regions potentially related to the textual data, but cannot precisely pinpoint specific areas. To address these issues, the authors propose a new adaptive image-text matching model (AdaMatch), which dynamically extracts adaptive image patches to better capture abnormal regions of different sizes and locations, and achieves fine-grained alignment of these image patches with words in the medical reports through contrastive learning. Additionally, to enhance the interpretability of the generation process, the authors designed a bidirectional large language model based on AdaMatch (AdaMatch-Cyclic) for cyclic generation of CXR images and reports, and constructed textual and visual codebooks to assist the generation task. Experimental results show that this method outperforms existing methods on public CXR datasets.