MUVF-YOLOX: A Multi-modal Ultrasound Video Fusion Network for Renal Tumor Diagnosis

Junyu Li,Han Huang,Dong Ni,Wufeng Xue,Dongmei Zhu,Jun Cheng
2023-07-15
Abstract:Early diagnosis of renal cancer can greatly improve the survival rate of patients. Contrast-enhanced ultrasound (CEUS) is a cost-effective and non-invasive imaging technique and has become more and more frequently used for renal tumor diagnosis. However, the classification of benign and malignant renal tumors can still be very challenging due to the highly heterogeneous appearance of cancer and imaging artifacts. Our aim is to detect and classify renal tumors by integrating B-mode and CEUS-mode ultrasound videos. To this end, we propose a novel multi-modal ultrasound video fusion network that can effectively perform multi-modal feature fusion and video classification for renal tumor diagnosis. The attention-based multi-modal fusion module uses cross-attention and self-attention to extract modality-invariant features and modality-specific features in parallel. In addition, we design an object-level temporal aggregation (OTA) module that can automatically filter low-quality features and efficiently integrate temporal information from multiple frames to improve the accuracy of tumor diagnosis. Experimental results on a multicenter dataset show that the proposed framework outperforms the single-modal models and the competing methods. Furthermore, our OTA module achieves higher classification accuracy than the frame-level predictions. Our code is available at \url{<a class="link-external link-https" href="https://github.com/JeunyuLi/MUAF" rel="external noopener nofollow">this https URL</a>}.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of early diagnosis of kidney cancer. Specifically, it proposes a Multimodal Ultrasound Video Fusion Network (MUVF-YOLOX) for the detection and classification of kidney tumors. The main points addressed are as follows: 1. **Multimodal Information Fusion**: By integrating ultrasound video information from B-mode and CEUS mode, the accuracy of kidney tumor diagnosis is improved. The proposed Attention Mechanism Multimodal Fusion (AMF) module can simultaneously extract modality-invariant features and modality-specific features. 2. **Temporal Information Aggregation**: An Object-level Temporal Aggregation (OTA) module is designed to automatically filter high-quality features in the temporal dimension of the video and efficiently fuse multi-frame information to improve the accuracy of tumor diagnosis. 3. **Dataset Construction**: The first multimodal ultrasound video dataset containing B-mode and CEUS mode is established for research on kidney tumor diagnosis. Through the above methods, experimental results show that the proposed framework achieves better performance over single-modal models and other existing methods, especially on multi-center datasets, validating its good generalization ability.