Ultrasound Report Generation with Cross-Modality Feature Alignment via Unsupervised Guidance

Jun Li,Tongkun Su,Baoliang Zhao,Faqin Lv,Qiong Wang,Nassir Navab,Ying Hu,Zhongliang Jiang
2024-06-02
Abstract:Automatic report generation has arisen as a significant research area in computer-aided diagnosis, aiming to alleviate the burden on clinicians by generating reports automatically based on medical images. In this work, we propose a novel framework for automatic ultrasound report generation, leveraging a combination of unsupervised and supervised learning methods to aid the report generation process. Our framework incorporates unsupervised learning methods to extract potential knowledge from ultrasound text reports, serving as the prior information to guide the model in aligning visual and textual features, thereby addressing the challenge of feature discrepancy. Additionally, we design a global semantic comparison mechanism to enhance the performance of generating more comprehensive and accurate medical reports. To enable the implementation of ultrasound report generation, we constructed three large-scale ultrasound image-text datasets from different organs for training and validation purposes. Extensive evaluations with other state-of-the-art approaches exhibit its superior performance across all three datasets. Code and dataset are valuable at this link.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively align visual features with textual features in the generation of medical ultrasound image reports to produce accurate and detailed medical reports. Specifically, the paper points out that current medical image report generation methods mainly focus on radiology reports, while there is relatively little research on ultrasound report generation. Ultrasound images have unique challenges, such as low contrast and the presence of artifacts, which make it difficult to extract relevant visual features from the images for text description. In addition, ultrasound reports are usually longer and more detailed than radiology reports, which increases the complexity of text generation. To address these challenges, the authors propose a new framework that combines unsupervised and supervised learning methods, aiming to extract latent medical knowledge and use it as prior information to guide the model to align visual and textual features. The framework consists of three modules: 1. **Knowledge Distiller (KD)**: Extract latent prior knowledge from ultrasound reports through an unsupervised learning method, simulating the process by which doctors acquire knowledge from medical records. 2. **Knowledge Matched Visual Extractor (KMVE)**: Use the prior knowledge extracted from the KD module as pseudo - labels to promote the learning of visual features, thereby bridging the gap between visual and textual features. 3. **Report Generator (RG)**: Generate text reports based on the aligned visual features and design a similarity comparison mechanism to ensure the consistency of the generated reports in terms of length and accuracy. The main contributions of the paper include: - Proposing a new framework that extracts latent medical knowledge through unsupervised and supervised learning methods without additional disease labels, thereby reducing the differences between visual and textual features. - Designing a similarity comparison mechanism that combines global semantic information to generate complex sentences, making the generated reports more accurate and detailed. - Constructing three large - scale ultrasound image - text datasets for the breast, thyroid, and liver respectively, demonstrating the generalization ability of the method. Through these innovations, this framework can generate high - quality medical reports on ultrasound images of multiple organs, providing an effective auxiliary tool for clinicians.