Rapid and accurate multi-phenotype imputation for millions of individuals

Lin-Lin Gu,Hong-Shan Wu,Tian-Yi Liu,Yong-Jie Zhang,Jing-Cheng He,Xiao-Lei Liu,Zhi-Yong Wang,Guo-Bo Chen,Dan Jiang,Ming Fang
DOI: https://doi.org/10.1101/2023.06.25.546422
2024-06-04
Abstract:Deep phenotyping can enhance the power of genetic analysis, including genome-wide association studies (GWAS), but the occurrence of missing phenotypes compromises the potential of such resources. Although many phenotypic imputation methods have been developed, the accurate imputation of millions of individuals remains extremely challenging. In the present study, we developed a novel multi-phenotype imputation method based on mixed fast random forest (PIXANT) by leveraging efficient machine learning (ML)-based algorithms. We demonstrate that PIXANT runtime is faster and computer memory usage is less than that of other state-of-the-art methods when applied to the UK Biobank (UKB) data, suggesting that PIXANT is scalable to cohorts with millions of individuals. Our simulations with hundreds of individuals showed that PIXANT accuracy was superior to or comparable to the accuracy of the most advanced methods available. PIXANT was used to impute 425 phenotypes for the UKB data of 277,301 unrelated White British citizens. When GWAS was subsequently performed on the imputed phenotypes, 18.4% more GWAS loci were identified than before imputation (8,710 vs 7,355). The increased statistical power of GWAS identified novel positional candidate genes affecting heart rate, such as RNF220, SCN10A, and RGS6, suggesting that the use of imputed phenotype data from a large cohort may lead to the discovery of novel genes for complex traits.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the decline in the ability of genetic analysis due to the lack of phenotypic data in large - scale genomic research. Specifically, the author has developed a new multi - phenotypic imputation method - multi - phenotypic imputation method based on mixed fast random forest (PIXANT), aiming to improve the ability to perform efficient and accurate multi - phenotypic imputation on millions of individuals. Through this method, the data resources of large cohorts can be utilized more completely, and the statistical power of genome - wide association studies (GWAS) can be enhanced, thereby discovering more candidate genes related to complex traits. For example, in the experiment using the UK Biobank data set, GWAS was performed on the phenotypic data imputed by PIXANT, and the number of GWAS loci identified increased by 18.4% compared with that before imputation, and new candidate genes affecting heart rate such as RNF220, SCN10A and RGS6 were discovered. This indicates that the use of imputed phenotypic data in large - scale cohorts can promote the discovery of new genes.