Abstract:Deep learning-based multi-omics data integration methods have the capability to reveal the mechanisms of cancer development, discover cancer biomarkers and identify pathogenic targets. However, current methods ignore the potential correlations between samples in integrating multi-omics data. In addition, providing accurate biological explanations still poses significant challenges due to the complexity of deep learning models. Therefore, there is an urgent need for a deep learning-based multi-omics integration method to explore the potential correlations between samples and provide model interpretability. Herein, we propose a novel interpretable multi-omics data integration method (DeepKEGG) for cancer recurrence prediction and biomarker discovery. In DeepKEGG, a biological hierarchical module is designed for local connections of neuron nodes and model interpretability based on the biological relationship between genes/miRNAs and pathways. In addition, a pathway self-attention module is constructed to explore the correlation between different samples and generate the potential pathway feature representation for enhancing the prediction performance of the model. Lastly, an attribution-based feature importance calculation method is utilized to discover biomarkers related to cancer recurrence and provide a biological interpretation of the model. Experimental results demonstrate that DeepKEGG outperforms other state-of-the-art methods in 5-fold cross validation. Furthermore, case studies also indicate that DeepKEGG serves as an effective tool for biomarker discovery. The code is available at https://github.com/lanbiolab/DeepKEGG.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to predict cancer recurrence and discover biomarkers by integrating multi - omics data. Specifically, the existing multi - omics data integration methods may perform well in prediction performance, but are insufficient in providing biological explanations and often ignore the potential correlations between samples. Therefore, the paper proposes a new deep - learning framework (DeepKEGG), aiming at: 1. **Explore the potential correlations between samples**: By constructing the Pathway Self - Attention Module, learn the relevant features between different samples, thereby improving the prediction performance of the model. 2. **Provide the interpretability of the model**: Design an attribution - based feature importance calculation method to discover biomarkers related to cancer recurrence and provide biological explanations for the model. ### Main contributions: 1. **Construct a biological - level module**: Based on prior biological knowledge, construct a biological - level module with locally connected neural nodes to learn path feature representations. This method can effectively alleviate the over - fitting problem of small - sample high - dimensional omics data in deep neural networks and is helpful for post - hoc interpretation of the model. 2. **Pathway Self - Attention Module**: Construct three Pathway Self - Attention Modules to learn the potentially relevant features of different samples in the path feature space, thereby improving the prediction performance of the model. 3. **Path Contribution Allocation Method**: Design a path contribution allocation method. First, use back - propagation to calculate the reference gradient of input features (genes/miRNAs) to the prediction result as the contribution score of the feature node, and then re - allocate the contribution score to the connected paths according to the out - degree of the feature node to evaluate the contribution of the feature node to the path. 4. **Experimental verification**: Conduct five 5 - fold cross - validation experiments on four TCGA cancer datasets and two TARGET cancer datasets. The results show that DeepKEGG is superior to other advanced classification methods in prediction performance. In addition, case studies also prove the superiority of DeepKEGG in biomarker discovery and model interpretation. ### Method overview: - **Data pre - processing**: Obtain cancer datasets from the TCGA and TARGET databases, including SNV data, mRNA data, and miRNA data. Pre - process these data, including gene annotation, normalization, feature selection, etc. - **Biological - level module**: Construct three relationship matrices (mRNA - pathway, SNV - pathway, and miRNA - pathway), and perform local connection and model interpretation based on the biological relationships of gene/miRNA - pathway. - **Pathway Self - Attention Module**: Learn the path feature correlations between different samples through the self - attention mechanism. - **Classification module**: Use a multi - layer perceptron (MLP) for cancer recurrence prediction. - **Model interpretation module**: Use the DeepLIFT method to calculate the importance scores of features (genes, miRNAs) and evaluate their contributions to the prediction results. ### Experimental results: - **Performance comparison**: On multiple cancer datasets, DeepKEGG performs well in indicators such as AUC and AUPR, and is superior to other advanced methods. - **Case study**: Demonstrate the advantages of DeepKEGG in biomarker discovery and model interpretation. In conclusion, by proposing the DeepKEGG framework, this paper not only improves the accuracy of cancer recurrence prediction, but also provides biological explanations, which is helpful for better understanding the molecular mechanisms of cancer development.

DeepKEGG: a multi-omics data integration framework with biological insights for cancer recurrence prediction and biomarker discovery

Interpretable meta-learning of multi-omics data for survival analysis and pathway enrichment

DeepOmix: A scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis

TMODINET: A trustworthy multi-omics dynamic learning integration network for cancer diagnostic

Transformer-based deep learning integrates multi-omic data with cancer pathways

Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach

Deep Biological Pathway Informed Pathology-Genomic Multimodal Survival Prediction

GD‐Net: An Integrated Multimodal Information Model Based on Deep Learning for Cancer Outcome Prediction and Informative Feature Selection

Deep learning assisted multi-omics integration for survival and drug-response prediction in breast cancer

GREMI: An Explainable Multi-Omics Integration Framework for Enhanced Disease Prediction and Module Identification

Deep Learning-Based Multi-Omics Integration Robustly Predicts Relapse in Prostate Cancer

DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data

Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis

Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application to Breast Cancer

Pancancer survival prediction using a deep learning architecture with multimodal representation and integration

Integration of multi-omics data to mine cancer-related gene modules

DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis

A denoised multi-omics integration framework for cancer subtype classification and survival prediction

InDEP: an interpretable machine learning approach to predict cancer driver genes from multi-omics data

Identification of functional gene modules by integrating multi-omics data and known molecular interactions

Biology-guided deep learning predicts prognosis and cancer immunotherapy response