MMIST-ccRCC: A Real World Medical Dataset for the Development of Multi-Modal Systems

Tiago Mota,M. Rita Verdelho,Alceu Bissoto,Carlos Santiago,Catarina Barata
2024-05-03
Abstract:The acquisition of different data modalities can enhance our knowledge and understanding of various diseases, paving the way for a more personalized healthcare. Thus, medicine is progressively moving towards the generation of massive amounts of multi-modal data (\emph{e.g,} molecular, radiology, and histopathology). While this may seem like an ideal environment to capitalize data-centric machine learning approaches, most methods still focus on exploring a single or a pair of modalities due to a variety of reasons: i) lack of ready to use curated datasets; ii) difficulty in identifying the best multi-modal fusion strategy; and iii) missing modalities across patients. In this paper we introduce a real world multi-modal dataset called MMIST-CCRCC that comprises 2 radiology modalities (CT and MRI), histopathology, genomics, and clinical data from 618 patients with clear cell renal cell carcinoma (ccRCC). We provide single and multi-modal (early and late fusion) benchmarks in the task of 12-month survival prediction in the challenging scenario of one or more missing modalities for each patient, with missing rates that range from 26$\%$ for genomics data to more than 90$\%$ for MRI. We show that even with such severe missing rates the fusion of modalities leads to improvements in the survival forecasting. Additionally, incorporating a strategy to generate the latent representations of the missing modalities given the available ones further improves the performance, highlighting a potential complementarity across modalities. Our dataset and code are available here:
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper addresses the challenges of applying multimodal data in the field of medicine, including the lack of available integrated datasets, the identification of optimal fusion strategies, and the issue of missing modalities between patients. To tackle this, the paper proposes a real-world multimodal dataset called MMIST-ccRCC, consisting of radiological, histopathological, genomic, and clinical data of kidney cancer patients. Despite the presence of a significant amount of missing data, the study found that modal fusion can improve the accuracy of survival prediction and further enhance performance by generating latent feature vectors for missing modalities. The paper also provides benchmark tests and strategies for handling missing data to facilitate multimodal research.