Visual-Textual Cross-Modal Interaction Network for Radiology Report Generation

Wenfeng Zhang,Baoning Cai,Jianming Hu,Qibing Qin,Kezhen Xie
DOI: https://doi.org/10.1109/lsp.2024.3379005
2024-04-09
IEEE Signal Processing Letters
Abstract:The radiology report generation task generates diagnostic descriptions from radiology images, aiming to alleviate the onerous task for radiologists and alerting them to abnormalities. However, the data bias problem poses a persistent challenge, since the abnormal regions usually occupy a small portion of radiology image, while the report generation process should pay greater attention to the abnormal regions. Moreover, the data volume is relatively small compared to large language models, posing challenges during training. To address these issues effectively, we propose a Visual-textual Cross-model Interaction Network (VCIN) to enhance the quality of generated reports. VCIN comprises two key modules: Abundant Clinical Information Embedding (ACIE), which gathers rich cross-modal interaction information to promote the report generation of abnormal regions; and a Bert-based Decoder-only Generator (BDG), built on Bert architecture to mitigate training difficulties. The superior performance of our proposed model is demonstrated through experimental results obtained from two public benchmark datasets.
engineering, electrical & electronic
What problem does this paper attempt to address?