Evaluation of network-guided random forest for disease gene discovery

Jianchang Hu,Silke Szymczak
DOI: https://doi.org/10.1186/s13040-024-00361-5
2024-04-19
BioData Mining
Abstract:Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF.
mathematical & computational biology
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to explore the effectiveness of using the Random Forest (RF) algorithm for disease - gene discovery under network guidance. Specifically, the author hopes to evaluate whether integrating gene - network information into the Random Forest algorithm can improve the accuracy of identifying disease - related gene modules and pathways. #### Background problems 1. **Limitations of traditional methods**: - In classical gene - expression analysis, genes are usually analyzed individually, ignoring the functional interdependence between genes. - Complex diseases (such as cancer) are rarely caused by the abnormality of a single gene, but are jointly caused by multiple genes and their interactions. 2. **Importance of gene networks**: - Gene - network information (such as protein - protein interaction networks and gene - regulation networks) is considered beneficial for identifying disease modules and pathways. - However, this information is not explicitly utilized in the standard Random Forest algorithm. #### Research objectives 1. **Improve disease prediction and gene discovery**: - By adjusting the variable - sampling probability in the Random Forest under network guidance, the author hopes to explore whether this method can improve the accuracy of disease prediction. - More importantly, the author focuses on the performance of this method in identifying disease - related gene modules and pathways. 2. **Evaluate the impact of network information**: - The author evaluates the impact of network information on disease - gene identification and prediction performance when using the Random Forest algorithm under network guidance through simulation studies and actual - data verification. - In particular, they hope to understand whether disease - genes that form modules can be more accurately identified with the help of network information. 3. **Prevent false selection**: - The author also focuses on whether there is a problem of false - gene selection due to network information when using the Random Forest algorithm under network guidance, especially for hub genes. ### Method overview To achieve the above objectives, the author proposes the network - guided Random Forest (network - guided RF) and evaluates it through the following steps: 1. **Construct the network - guided Random Forest**: - Use gene - network information to modify the sampling probability of variables in the Random Forest, giving priority to genes with higher topological importance in the network (such as hub genes). - This modification can be achieved by the Directed Random Walk (DRW) algorithm, which calculates the sampling probability of each gene based on the network structure. 2. **Simulation study**: - Generate synthetic gene - expression data and set different scenarios (such as no disease - genes, randomly distributed disease - genes, modular disease - genes, etc.). - Evaluate the performance of different methods (including standard Random Forest, margin - association - information - guided Random Forest, Random Forest based only on network topology, etc.) in disease prediction and gene identification. 3. **Actual - data analysis**: - Use two independent breast - cancer datasets (from TCGA, based on microarray and RNA - sequencing technologies respectively) to evaluate the performance of the network - guided Random Forest in predicting progesterone - receptor (PR) status. ### Conclusion The research results of the author show that the network - guided Random Forest can indeed more accurately identify disease - genes in some cases (such as when disease - genes form modules), but may be inferior to the standard Random Forest in other cases (such as when disease - genes are randomly distributed). In addition, the author emphasizes the need for caution when using network information to avoid false - gene selection. In general, this paper attempts to improve the performance of the Random Forest algorithm in disease - gene discovery by introducing gene - network information and comprehensively evaluates its effectiveness and potential problems.