DeepKEGG: a multi-omics data integration framework with biological insights for cancer recurrence prediction and biomarker discovery

Wei Lan,Haibo Liao,Qingfeng Chen,Lingzhi Zhu,Yi Pan,Yi-Ping Phoebe Chen
DOI: https://doi.org/10.1093/bib/bbae185
IF: 9.5
2024-04-28
Briefings in Bioinformatics
Abstract:Deep learning-based multi-omics data integration methods have the capability to reveal the mechanisms of cancer development, discover cancer biomarkers and identify pathogenic targets. However, current methods ignore the potential correlations between samples in integrating multi-omics data. In addition, providing accurate biological explanations still poses significant challenges due to the complexity of deep learning models. Therefore, there is an urgent need for a deep learning-based multi-omics integration method to explore the potential correlations between samples and provide model interpretability. Herein, we propose a novel interpretable multi-omics data integration method (DeepKEGG) for cancer recurrence prediction and biomarker discovery. In DeepKEGG, a biological hierarchical module is designed for local connections of neuron nodes and model interpretability based on the biological relationship between genes/miRNAs and pathways. In addition, a pathway self-attention module is constructed to explore the correlation between different samples and generate the potential pathway feature representation for enhancing the prediction performance of the model. Lastly, an attribution-based feature importance calculation method is utilized to discover biomarkers related to cancer recurrence and provide a biological interpretation of the model. Experimental results demonstrate that DeepKEGG outperforms other state-of-the-art methods in 5-fold cross validation. Furthermore, case studies also indicate that DeepKEGG serves as an effective tool for biomarker discovery. The code is available at https://github.com/lanbiolab/DeepKEGG.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to predict cancer recurrence and discover biomarkers by integrating multi - omics data. Specifically, the existing multi - omics data integration methods may perform well in prediction performance, but are insufficient in providing biological explanations and often ignore the potential correlations between samples. Therefore, the paper proposes a new deep - learning framework (DeepKEGG), aiming at: 1. **Explore the potential correlations between samples**: By constructing the Pathway Self - Attention Module, learn the relevant features between different samples, thereby improving the prediction performance of the model. 2. **Provide the interpretability of the model**: Design an attribution - based feature importance calculation method to discover biomarkers related to cancer recurrence and provide biological explanations for the model. ### Main contributions: 1. **Construct a biological - level module**: Based on prior biological knowledge, construct a biological - level module with locally connected neural nodes to learn path feature representations. This method can effectively alleviate the over - fitting problem of small - sample high - dimensional omics data in deep neural networks and is helpful for post - hoc interpretation of the model. 2. **Pathway Self - Attention Module**: Construct three Pathway Self - Attention Modules to learn the potentially relevant features of different samples in the path feature space, thereby improving the prediction performance of the model. 3. **Path Contribution Allocation Method**: Design a path contribution allocation method. First, use back - propagation to calculate the reference gradient of input features (genes/miRNAs) to the prediction result as the contribution score of the feature node, and then re - allocate the contribution score to the connected paths according to the out - degree of the feature node to evaluate the contribution of the feature node to the path. 4. **Experimental verification**: Conduct five 5 - fold cross - validation experiments on four TCGA cancer datasets and two TARGET cancer datasets. The results show that DeepKEGG is superior to other advanced classification methods in prediction performance. In addition, case studies also prove the superiority of DeepKEGG in biomarker discovery and model interpretation. ### Method overview: - **Data pre - processing**: Obtain cancer datasets from the TCGA and TARGET databases, including SNV data, mRNA data, and miRNA data. Pre - process these data, including gene annotation, normalization, feature selection, etc. - **Biological - level module**: Construct three relationship matrices (mRNA - pathway, SNV - pathway, and miRNA - pathway), and perform local connection and model interpretation based on the biological relationships of gene/miRNA - pathway. - **Pathway Self - Attention Module**: Learn the path feature correlations between different samples through the self - attention mechanism. - **Classification module**: Use a multi - layer perceptron (MLP) for cancer recurrence prediction. - **Model interpretation module**: Use the DeepLIFT method to calculate the importance scores of features (genes, miRNAs) and evaluate their contributions to the prediction results. ### Experimental results: - **Performance comparison**: On multiple cancer datasets, DeepKEGG performs well in indicators such as AUC and AUPR, and is superior to other advanced methods. - **Case study**: Demonstrate the advantages of DeepKEGG in biomarker discovery and model interpretation. In conclusion, by proposing the DeepKEGG framework, this paper not only improves the accuracy of cancer recurrence prediction, but also provides biological explanations, which is helpful for better understanding the molecular mechanisms of cancer development.