Multi-modal Fusion with Semantic Supervision for Radiology Report Generation

Xing Jia,Yun Xiong,Yao Zhang,Li Luo
2023-01-01
Abstract:Radiology report generation, as one way of analyzing radiology images, is to generate a textual report automatically for the given image, and it is of great significance to assist diagnosis and alleviate the workload of radiologists. Some report generation methods have been therefore proposed. However, these methods suffer from the problem of low-quality generation, because of the visual and textual bias and training with text similarity oriented objective. To solve this problem, we propose a novel radiology report generation model with multi-modal fusion and semantic supervision, namely MS-Gen. MS-Gen consists of two main components, i.e., the semantic-visual fusion module and the semantic weighted contrastive loss. Specifically, the main idea of the semantic-visual fusion module is to make use of the domain-specific prior knowledge contained in a large pre-trained visual-language model and also the complementary nature between the image and text modalities. Moreover, a novel optimization term, i.e., the semantic weighted contrastive loss, is proposed to guide the optimization process with semantic similarity objective, and further enforce the generated reports with higher clinical accuracy. Extensive experiments conducted on two real datasets of IU X-Ray and MIMIC-CXR demonstrate the effectiveness of our proposed model.
What problem does this paper attempt to address?