R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

Xiao Wang,Yuehang Li,Fuling Wang,Shiao Wang,Chuanfu Li,Bo Jiang
2024-08-19
Abstract:Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \url{<a class="link-external link-https" href="https://github.com/Event-AHU/Medical_Image_Analysis" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address two main issues faced when generating high-quality medical reports based on X-ray images: 1. **How to extract more effective information to improve the performance of large language models (LLMs)**: - Current methods typically use Transformers to extract visual features from given X-ray images and then input these features into LLMs for text generation. However, how to extract more effective information from X-ray images to help LLMs generate higher quality medical reports is an urgent problem to be solved. 2. **High computational complexity**: - While using visual Transformer models can extract high-quality visual features, it brings the problem of high computational complexity. Especially when dealing with long-distance visual tokens (e.g., high-resolution X-ray images), the self-attention mechanism of Transformers performs poorly in terms of speed and memory usage. To address these issues, the paper proposes a new context-guided efficient X-ray medical report generation framework (R2GenCSR). Specifically, this framework uses Mamba as the visual backbone network and retrieves context samples from the training set during the training phase, utilizing positively and negatively correlated samples to enhance feature representation and discriminative learning. Finally, these visual tokens, context information, and prompt statements are input into LLMs to generate high-quality medical reports. ### Main Contributions 1. **Proposed a new X-ray medical report generation framework (R2GenCSR) based on large language models**, which enhances the training phase by retrieving context samples. 2. **Proposed an efficient and effective half-precision visual Mamba network**, whose performance is comparable to widely used Transformer backbone networks. 3. **Conducted extensive experiments on widely used IU-Xray, MIMIC-CXR, and CheXpert Plus datasets**, fully validating the effectiveness of the proposed X-ray report generation framework.