Abstract:Introduction: To better understand the complex and challenging nature of diseases such as cancer and for improved diagnosis, it may require the combination of multiple data modalities, such as histopathological images and omics data such as RNA-seq. By integrating these heterogeneous but complementary data, a multimodal approach unites both worlds and could achieve better synergistic results compared to using a single modality. The growing availability of large datasets such as The Cancer Genome Atlas (TCGA) with more than 10000 patients made it possible to combine different modalities to train machine learning algorithms which offers great potential to address challenging cancer related research. In this proof of concept initiative we use machine learning approaches within an open-source framework in order to leverage the potential of multimodality (Histopathology Whole Slide Images (WSI) and Genomics/RNA-seq) to build predictive AI models for cancer type and prostate Gleason score, and provide a potential to develop a quality control step. Method: We used matched WSI and RNA-Seq profiles from TCGA, including 11093 samples and 30 cancer types to develop a pancancer classification model using both modalities. For prostate Gleason score prediction 401 patients were available. Both datasets were split into a train (70%) and test (30%) components. We used a late fusion approach where we combined the RNA-seq model (linear SVM) with the WSI model (Resnet18) by multiplying the probability scores of each single-modality model. Model performance was measured with the F1 metric. Results: For cancer type prediction, the multimodality model achieved an F1 score of 0.95 on the test set. About 40% of the cancer types benefited from a synergistic effect by combining the two modalities. Cancer types and percent increase in F1 scores, respectively, that benefit most by combining modalities are: Cervical squamous cell carcinoma and endocervical adenocarcinoma (4.23%), Cholangiosarcoma (6.66%) and Uterine carcinosarcoma (4%). Interestingly, in other cancer types the combination did not result in improved predictive scores compared to a single modality model, e.g. in Rectum adenocarcinoma, Sarcoma or Stomach adenocarcinoma. For Prostate cancer grading, Gleason score prediction of patterns 3/4/5, combined multi modality model earned 0.73 F1 outperforming the single modality models. Conclusion: By combining histopathology imaging and omics modalities we demonstrated synergistic effects in predictive power for both cancer-related research questions. We show improved predictive performance in 40% of the classified cancer types by taking both modalities. Imaging or omics modalities alone can be sufficient in some cases and their strengths are very problem-specific. Citation Format: Christian Wohlfart, Eldad Klaiman, Jacub Witkowski, Michael King, Jacob Gildenblat, Ofir Etz-Hadar, Mohammad Ashtari, Antoaneta Vladimirova. Multi-modal machine learning approaches for predicting cancer type and Gleason grade leveraging public TCGA data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 4970.

Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification

Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis

Gene Expression-Based Cancer Classification for Handling the Class Imbalance Problem and Curse of Dimensionality

A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification

Multiclass Cancer Classification by Using Fuzzy Support Vector Machine and Binary Decision Tree with Gene Selection

An Intelligent Classification System for Cancer Detection Based on DNA Methylation Using ML and Semantic Knowledge in Healthcare

Multi-Omics Integration for Liver Cancer Using Regression Analysis

Multicategory Survival Outcomes Classification via Overlapping Group Screening Process Based on Multinomial Logistic Regression Model With Application to TCGA Transcriptomic Data

Dimension Reduction and Classifier-Based Feature Selection for Oversampled Gene Expression Data and Cancer Classification

The efficacy of various machine learning models for multi-class classification of RNA-seq expression data

ALL/AML Cancer Classification by Gene Expression Data Using SVM and CSVM Approach

Machine learning for multi-omics data integration in cancer

Abstract 4970: Multi-modal machine learning approaches for predicting cancer type and Gleason grade leveraging public TCGA data

Using Machine Learning Methods to Study Colorectal Cancer Tumor Micro-Environment and Its Biomarkers.

Survival Prediction from Imbalance colorectal cancer dataset using hybrid sampling methods and tree-based classifiers

Comparative Study of Cancer Classification by Analysis of RNA-seq Gene Expression Levels

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

Multi-Classification of Cancer Samples Based on Co-Expression Analyses

A new parsimonious method for classifying Cancer Tissue-of-Origin Based on DNA Methylation 450K data

Integrated Multi-omics Analysis Using Variational Autoencoders: Application to Pan-cancer Classification

A survey on single and multi omics data mining methods in cancer data classification