Gene set proximity analysis: expanding gene set enrichment analysis through learned geometric embeddings

Henry Cousins,Taryn Hall,Yinglong Guo,Luke Tso,Kathy Tzy-Hwa Tzeng,Le Cong,Russ Altman
DOI: https://doi.org/10.1093/bioinformatics/btac735
2022-02-01
Abstract:Gene set analysis methods rely on knowledge-based representations of genetic interactions in the form of both gene set collections and protein-protein interaction (PPI) networks. Explicit representations of genetic interactions often fail to capture complex interdependencies among genes, limiting the analytic power of such methods. Here we propose an extension of gene set enrichment analysis to a latent feature space reflecting PPI network topology, called gene set proximity analysis (GSPA). Compared with existing methods, GSPA provides improved ability to identify disease-associated pathways in disease-matched gene expression datasets, while improving reproducibility of enrichment statistics for similar gene sets. GSPA is statistically straightforward, reducing to classical gene set enrichment through a single user-defined parameter. We apply our method to identify novel drug associations with SARS-CoV-2 viral entry. Finally, we validate our drug association predictions through retrospective clinical analysis of claims data from 8 million patients, supporting a role for gabapentin as a risk factor and metformin as a protective factor for COVID-19 hospitalization.
Quantitative Methods,Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in gene expression data analysis, the existing gene - set analysis methods cannot fully capture the complex interdependent relationships between genes, especially when dealing with noisy data or incomplete gene sets. Specifically, traditional Gene Set Enrichment Analysis (GSEA) and other network - based methods can identify pathway enrichment in differentially expressed gene sets, but they usually rely on explicit forms of gene interaction representation, such as gene - set collections and Protein - Protein Interaction (PPI) networks. These methods have limitations in identifying disease - related pathways because they cannot effectively detect pathway perturbations implied by changes in the expression of neighboring genes. To this end, the paper proposes an extended gene - set enrichment analysis method - Gene Set Proximity Analysis (GSPA). GSPA maps genes into a latent feature space that reflects the topological structure of the PPI network through learned geometric embeddings, thereby considering the complete network context of genes when analyzing gene - set enrichment. This method not only improves the ability to identify disease - related pathways but also enhances the reproducibility of statistical results for semantically similar gene sets. The paper verifies the effectiveness of GSPA in the following aspects: 1. **Performance evaluation**: Using the standard GEO2KEGG dataset, compare the performance of GSPA with GSEA and NGSEA (an advanced network - enhanced gene - set analysis method) in identifying known disease - related gene sets. The results show that GSPA is significantly superior to the other two methods in this task. 2. **Result consistency**: Evaluate the ability of GSPA to produce consistent results between different but semantically similar gene sets. GSPA also performs well on this metric, showing a stronger correlation. 3. **Drug re - utilization prediction**: Use GSPA to predict drugs that may regulate the entry of the SARS - CoV - 2 virus into host cells from gene expression data. By analyzing three CRISPR knockout screening datasets, GSPA successfully predicted several drugs that may affect SARS - CoV - 2 infection, including gabapentin, metformin, lorazepam, and clonazepam. 4. **Clinical verification**: Through retrospective analysis of the health insurance claim data of 8 million patients, verify the association between the drugs predicted by GSPA and the risk of COVID - 19 hospitalization. The study found that gabapentin may be a risk factor for COVID - 19 hospitalization, while metformin has a protective effect. In summary, this paper aims to overcome the limitations of existing gene - set analysis methods by introducing GSPA and improve the accuracy of identifying disease - related pathways and predicting drug effects.