Fine-Grained Image-Text Alignment in Medical Imaging Enables Cyclic Image-Report Generation

Yixuan Yuan,Wenting Chen,Linlin Shen,Xiang Li
DOI: https://doi.org/10.48550/arXiv.2312.08078
Abstract:Fine-grained vision-language models (VLM) have been widely used for inter-modality local alignment between the predefined fixed patches and textual words. However, in the medical analysis, lesions exhibit varying sizes and positions, and the fixed patches may cause incomplete representations of lesions. Moreover, these methods provide explain-ability by using heatmaps to show the general potential image areas associated with texts rather than specific regions, making their explanations not explicit and specific enough. To address these issues, we propose a novel Adaptive patch-word Matching (AdaMatch) model to correlate chest X-ray (CXR) image regions with words in medical reports and apply it to CXR-report generation to provide explainabil-ity for the generation process. AdaMatch exploits the fine-grained relation between adaptive patches and words to provide explanations of specific image regions with corresponding words. To capture the abnormal regions of varying sizes and positions, we introduce the Adaptive Patch extraction (AdaPatch) module to acquire the adaptive patches for these regions adaptively. In order to provide explicit explainability for CXR-report generation task, we propose an AdaMatch-based bidirectional large language model for Cyclic CXR-report generation (AdaMatch-Cyclic). It employs the AdaMatch to obtain the keywords for CXR images and ‘keypatches’ for medical reports as hints to guide CXR-report generation. Extensive experiments on two publicly available CXR datasets prove the effectiveness of our method and its superior performance to existing methods.
Computer Science,Medicine
What problem does this paper attempt to address?