Small-cohort GWAS discovery with AI over massive functional genomics knowledge graph

Kexin Huang,Tony Zeng,Soner Koc,Alexandra Pettet,Jingtian Zhou,Mika Jain,Dongbo Sun,Camilo Ruiz,Hongyu Ren,Laurence J Howe,Tom Richardson,Adrian Cortes,Katie Aiello,Kim Branson,Andreas R Pfenning,Jesse Engreitz,Martin Jinye Zhang,Jure Leskovec
DOI: https://doi.org/10.1101/2024.12.03.24318375
2024-12-05
Abstract:Genome-wide association studies (GWASs) have identified tens of thousands of disease-associated variants and provided critical insights into developing effective treatments. However, limited sample sizes have hindered the discovery of variants for uncommon and rare diseases. Here, we introduce KGWAS, a novel geometric deep learning method that leverages a massive functional knowledge graph across variants and genes to improve detection power in small-cohort GWASs significantly. KGWAS assesses the strength of a variant association with disease based on the aggregate GWAS evidence across molecular elements interacting with the variant within the knowledge graph. Comprehensive simulations and replication experiments showed that, for small sample sizes (N=1-10K), KGWAS identified up to 100% more statistically significant associations than state-of-the-art GWAS methods and achieved the same statistical power with up to 2.67X fewer samples. We applied KGWAS to 554 uncommon UK Biobank diseases (N_case<5K) and identified 183 more associations (46.9% improvement) than the original GWAS, where the gain further increases to 79.8% for 141 rare diseases (N_case<300). The KGWAS-only discoveries are supported by abundant functional evidence, such as rs2155219 (on 11q13) associated with ulcerative colitis potentially via regulating LRRC32 expression in CD4+ regulatory T cells, and rs7312765 (on 12q12) associated with the rare disease myasthenia gravis potentially via regulating PPHLN1 expression in neuron-related cell types. Furthermore, KGWAS consistently improves downstream analyses such as identifying disease-specific network links for interpreting GWAS variants, identifying disease-associated genes, and identifying disease-relevant cell populations. Overall, KGWAS is a flexible and powerful AI model that integrates growing functional genomics data to discover novel variants, genes, cells, and networks, especially valuable for small cohort diseases.
Genetic and Genomic Medicine
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of genome - wide association studies (GWAS) in discovering disease - related genetic variants in small - sample cohorts. Specifically, GWAS has identified thousands of disease - related genetic variants and provided important clues for the development of effective treatments. However, the limited sample size hinders the discovery of variants related to uncommon and rare diseases. To overcome this challenge, the authors introduced a new geometric deep - learning method - **KG WAS** (Knowledge Graph Genome - Wide Association Study), which utilizes a large - scale functional genomics knowledge graph (Functional Genomics Knowledge Graph) to significantly improve the detection ability of GWAS in small - sample cohorts. By integrating functional genomic data, KG WAS can more effectively identify disease - related genetic variants with a smaller sample size. ### Main problems and solutions 1. **Problem**: Traditional GWAS requires a large number of samples to provide sufficient statistical power to identify causal genetic variants, which is a major obstacle for uncommon and rare diseases because the sample sizes of these diseases are usually small. 2. **Solution**: - **KG WAS** encodes genetic variants, genes and their interaction relationships into a comprehensive knowledge graph using the functional genomics knowledge graph. - For a given disease, KG WAS trains a graph neural network (GNN) specific to that disease to predict the association strength between each variant and the disease. - KG WAS combines functional genomic data and GWAS summary statistics, and uses the predicted association strength as prior information to improve the p - value of GWAS, thereby improving the detection ability. ### Experimental results - **Simulation experiments**: KG WAS shows good calibration performance in simulation experiments of null models and causal models, and discovers more independent true associations than existing GWAS methods. - **Replication experiments**: Systematic replication experiments in multiple independent cohorts (such as UK Biobank and FinnGen) show that KG WAS performs particularly well in the case of small sample sizes and can achieve the same statistical power with fewer samples. - **Practical applications**: Applied to 554 uncommon diseases (including 141 rare diseases), KG WAS discovered 183 additional associations, which is 46.9% higher than the original GWAS. ### Summary KG WAS is a flexible and powerful AI model. By integrating the ever - increasing functional genomic data, it can more effectively discover new genetic variants, genes, cells and networks in small - sample cohorts, especially in the research of uncommon and rare diseases.