Abstract:Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \url{<a class="link-external link-https" href="https://github.com/Event-AHU/Medical_Image_Analysis" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address two main issues faced when generating high-quality medical reports based on X-ray images: 1. **How to extract more effective information to improve the performance of large language models (LLMs)**: - Current methods typically use Transformers to extract visual features from given X-ray images and then input these features into LLMs for text generation. However, how to extract more effective information from X-ray images to help LLMs generate higher quality medical reports is an urgent problem to be solved. 2. **High computational complexity**: - While using visual Transformer models can extract high-quality visual features, it brings the problem of high computational complexity. Especially when dealing with long-distance visual tokens (e.g., high-resolution X-ray images), the self-attention mechanism of Transformers performs poorly in terms of speed and memory usage. To address these issues, the paper proposes a new context-guided efficient X-ray medical report generation framework (R2GenCSR). Specifically, this framework uses Mamba as the visual backbone network and retrieves context samples from the training set during the training phase, utilizing positively and negatively correlated samples to enhance feature representation and discriminative learning. Finally, these visual tokens, context information, and prompt statements are input into LLMs to generate high-quality medical reports. ### Main Contributions 1. **Proposed a new X-ray medical report generation framework (R2GenCSR) based on large language models**, which enhances the training phase by retrieving context samples. 2. **Proposed an efficient and effective half-precision visual Mamba network**, whose performance is comparable to widely used Transformer backbone networks. 3. **Conducted extensive experiments on widely used IU-Xray, MIMIC-CXR, and CheXpert Plus datasets**, fully validating the effectiveness of the proposed X-ray report generation framework.

R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

Automatic Report Generation Method Based on Multiscale Feature Extraction and Word Attention Network.

R2Gen-Mamba: A Selective State Space Model for Radiology Report Generation

Resource-Efficient Medical Report Generation using Large Language Models

Large Language Model with Region-guided Referring and Grounding for CT Report Generation

TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Model

CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

MedXChat: A Unified Multimodal Large Language Model Framework towards CXRs Understanding and Generation

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

A Comparison of Maternal Interview and Medical Record Ascertainment of Violence among Women who had Poor Pregnancy Outcomes

KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models

Dia-LLaMA: Towards Large Language Model-driven CT Report Generation

Learning to Generate Radiology Findings from Impressions Based on Large Language Model

A Medical Semantic-Assisted Transformer for Radiographic Report Generation

Clinical Context-aware Radiology Report Generation from Medical Images using Transformers

Automatic Medical Report Generation Based on Cross-View Attention and Visual-Semantic Long Short Term Memorys

Automated Radiographic Report Generation Purely on Transformer: A Multicriteria Supervised Approach

Generating Radiology Reports via Memory-driven Transformer

Radiology Report Generation for Rare Diseases via Few-shot Transformer.

A label information fused medical image report generation framework

Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports