Abstract:Automatic report generation has arisen as a significant research area in computer-aided diagnosis, aiming to alleviate the burden on clinicians by generating reports automatically based on medical images. In this work, we propose a novel framework for automatic ultrasound report generation, leveraging a combination of unsupervised and supervised learning methods to aid the report generation process. Our framework incorporates unsupervised learning methods to extract potential knowledge from ultrasound text reports, serving as the prior information to guide the model in aligning visual and textual features, thereby addressing the challenge of feature discrepancy. Additionally, we design a global semantic comparison mechanism to enhance the performance of generating more comprehensive and accurate medical reports. To enable the implementation of ultrasound report generation, we constructed three large-scale ultrasound image-text datasets from different organs for training and validation purposes. Extensive evaluations with other state-of-the-art approaches exhibit its superior performance across all three datasets. Code and dataset are valuable at this link.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively align visual features with textual features in the generation of medical ultrasound image reports to produce accurate and detailed medical reports. Specifically, the paper points out that current medical image report generation methods mainly focus on radiology reports, while there is relatively little research on ultrasound report generation. Ultrasound images have unique challenges, such as low contrast and the presence of artifacts, which make it difficult to extract relevant visual features from the images for text description. In addition, ultrasound reports are usually longer and more detailed than radiology reports, which increases the complexity of text generation. To address these challenges, the authors propose a new framework that combines unsupervised and supervised learning methods, aiming to extract latent medical knowledge and use it as prior information to guide the model to align visual and textual features. The framework consists of three modules: 1. **Knowledge Distiller (KD)**: Extract latent prior knowledge from ultrasound reports through an unsupervised learning method, simulating the process by which doctors acquire knowledge from medical records. 2. **Knowledge Matched Visual Extractor (KMVE)**: Use the prior knowledge extracted from the KD module as pseudo - labels to promote the learning of visual features, thereby bridging the gap between visual and textual features. 3. **Report Generator (RG)**: Generate text reports based on the aligned visual features and design a similarity comparison mechanism to ensure the consistency of the generated reports in terms of length and accuracy. The main contributions of the paper include: - Proposing a new framework that extracts latent medical knowledge through unsupervised and supervised learning methods without additional disease labels, thereby reducing the differences between visual and textual features. - Designing a similarity comparison mechanism that combines global semantic information to generate complex sentences, making the generated reports more accurate and detailed. - Constructing three large - scale ultrasound image - text datasets for the breast, thyroid, and liver respectively, demonstrating the generalization ability of the method. Through these innovations, this framework can generate high - quality medical reports on ultrasound images of multiple organs, providing an effective auxiliary tool for clinicians.

Ultrasound Report Generation with Cross-Modality Feature Alignment via Unsupervised Guidance

Automatic Report Generation Method Based on Multiscale Feature Extraction and Word Attention Network.

An Inclusive Task-Aware Framework for Radiology Report Generation

Medical Report Generation Via Multimodal Spatio-Temporal Fusion

Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

Automatic Medical Report Generation Based on Cross-View Attention and Visual-Semantic Long Short Term Memorys

Radiology Report Generation with a Learned Knowledge Base and Multi-Modal Alignment

Automatic Radiology Reports Generation via Memory Alignment Network

A Self-Guided Framework for Radiology Report Generation

Visual prior-based cross-modal alignment network for radiology report generation

Chest radiology report generation based on cross-modal multi-scale feature fusion

Eye Gaze Guided Cross-Modal Alignment Network for Radiology Report Generation.

Unsupervised Disease Tags for Automatic Radiology Report Generation

On the Automatic Generation of Medical Imaging Reports

Similarity Retrieval and Medical Cross-Modal Attention Based Medical Report Generation

A Survey of Deep Learning-based Radiology Report Generation Using Multimodal Data

Automatic Ultrasound Image Report Generation with Adaptive Multimodal Attention Mechanism.

Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation

A label information fused medical image report generation framework

Generating radiology reports via auxiliary signal guidance and a memory-driven network

Radiology report generation with medical knowledge and multilevel image-report alignment: A new method and its verification