Abstract:Automatic radiology report generation is booming due to its huge application potential for the healthcare industry. However, existing computer vision and natural language processing approaches to tackle this problem are limited in two aspects. First, when extracting image features, most of them neglect multi-view reasoning in vision and model single-view structure of medical images, such as space-view or channel-view. However, clinicians rely on multi-view imaging information for comprehensive judgment in daily clinical diagnosis. Second, when generating reports, they overlook context reasoning with multi-modal information and focus on pure textual optimization utilizing retrieval-based methods. We aim to address these two issues by proposing a model that better simulates clinicians' perspectives and generates more accurate reports. Given the above limitation in feature extraction, we propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception. GIA aims to learn three types of vision perception: depth view, space view, and pixel view. On the other hand, to address the above problem in report generation, we explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction. Specifically, we design a Visual Knowledge-guided Decoder (VKGD), which can adaptively consider how much the model needs to rely on visual information and previously predicted text to assist next word prediction. Hence, our final Intensive Vision-guided Network (IVGN) framework includes a GIA-guided Visual Encoder and the VKGD. Experiments on two commonly-used datasets IU X-Ray and MIMIC-CXR demonstrate the superior ability of our method compared with other state-of-the-art approaches.

Visual-Textual Cross-Modal Interaction Network for Radiology Report Generation

VMEKNet: Visual Memory and External Knowledge Based Network for Medical Report Generation.

Automatic Report Generation Method Based on Multiscale Feature Extraction and Word Attention Network.

An Inclusive Task-Aware Framework for Radiology Report Generation

Intensive Vision-guided Network for Radiology Report Generation

Visual-Linguistic Causal Intervention for Radiology Report Generation

Visual prior-based cross-modal alignment network for radiology report generation

Cross-modal Prototype Driven Network for Radiology Report Generation

Reinforced visual interaction fusion radiology report generation

Bridging the Gap: Cross-modal Knowledge Driven Network for Radiology Report Generation

Interactive dual-stream contrastive learning for radiology report generation

Multifocal region-assisted cross-modality learning for chest X-ray report generation

Eye Gaze Guided Cross-Modal Alignment Network for Radiology Report Generation.

Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays

MATNet: Exploiting Multi-Modal Features for Radiology Report Generation.

Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation

A medical report generation method integrating teacher–student model and encoder–decoder network

Cross-Modal Causal Intervention for Medical Report Generation

Generating radiology reports via auxiliary signal guidance and a memory-driven network

Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation

Cross-modal Contrastive Attention Model for Medical Report Generation.