Optimized phenotype definitions boost GWAS power

Michael Zietz,Kathleen LaRow Brown,Undina Gisladottir,Nicholas Tatonetti
DOI: https://doi.org/10.1101/2024.06.11.598562
2024-06-13
Abstract:Complex diseases are among the central challenges facing the world, and genetics underlie a large fraction of the risk. Observational data, such as electronic health records (EHR), offer numerous advantages in the study of complex disease genetics. These include their large scale, cost-effectiveness, information on many different conditions, and future scalability with the widespread adoption of EHRs. Observational data, however, are challenging for research as they reflect various factors including the healthcare process and access to care, as well as broader societal effects like systemic biases. Here, we introduce MaxGCP, a novel phenotyping method designed to purify the genetic signal in observational data. Our approach optimizes a phenotype definition to maximize its coheritability with the complex trait of interest. We validated the method in simulations and applied it to real data analyses of stroke and Alzheimer's disease. We found that MaxGCP improves genome-wide association study (GWAS) power compared to conventional, single-code phenotype definitions. MaxGCP is a powerful tool for genetic discovery in observational data, and we anticipate that it will be broadly useful for studying complex diseases using observational data.
Bioinformatics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve an important challenge in the genetic research of complex diseases, that is, how to improve the efficiency of genome - wide association studies (GWAS). Specifically, the authors propose a new method named MaxGCP for optimizing phenotype definitions to enhance genetic signals and reduce the impact of environmental noise. #### Background and Problem Description 1. **Genetic Basis of Complex Diseases** - Although progress has been made in theory and experiments over the past few decades and thousands of studies on the genetics of various complex diseases have been carried out, the genetic basis of complex diseases is still not fully understood. - Observational data (such as electronic health records, EHR) provide a large amount of information for the genetics research of complex diseases, including its large - scale, cost - effectiveness, coverage of multiple conditions, and scalability in the future with the wide application of EHR. 2. **Challenges of Observational Data** - Observational data reflects medical processes, access to medical services, and broader social effects (such as systemic biases), which complicate the research. - These data usually have problems such as incompleteness, noise, and bias, which reduce the power of the research. 3. **Limitations of Existing Methods** - Existing methods mainly rely on single - coded phenotype definitions, which may not fully utilize diverse phenotype information, resulting in weak genetic signals. - Some methods require pre - selection of features, which limits their application in large - scale biobank datasets. #### Core Problems of the MaxGCP Method To overcome the above challenges, the authors propose MaxGCP (Maximized Genetic Covariance Phenotyping), a new statistical method that maximizes coheritability with the target phenotype by optimizing the phenotype definition of linear combinations. Specifically: - **Optimizing Phenotype Definitions**: MaxGCP constructs a phenotype definition of a linear combination by combining information of multiple related phenotypes to enhance genetic signals and reduce environmental noise. - **Improving GWAS Efficiency**: By optimizing phenotype definitions, MaxGCP can discover more genetic variations in GWAS, thereby improving the statistical efficiency of the research. #### Formula Representation Suppose any phenotype \( A \) can be written as the sum of a genetic component \( g_A \) and an environmental component \( e_A \): \[ A = g_A + e_A \] The coheritability between two phenotypes \( A \) and \( B \) is defined as: \[ h_c^2(A, B)=\frac{\text{Cov}(g_A, g_B)}{\sqrt{\text{Var}(A)\text{Var}(B)}} \] Define an index \( y \) as a linear combination of characteristic phenotypes \( x_1,\cdots, x_m \): \[ y = x^{\top}\beta \] The goal of MaxGCP is to find the coefficient vector \(\hat{\beta}\) that maximizes the coheritability between the index \( y \) and the target phenotype \( z \). The specific optimization formula is: \[ \hat{\beta}=P^{-1}v\sqrt{v^{\top}P^{-1}v} \] where: - \( v \) is the genetic covariance vector between characteristic phenotypes and the target phenotype, that is, \( v_i = \text{Cov}(g_{x_i}, g_z) \) - \( P \) is the covariance matrix of characteristic phenotypes, that is, \( P_{i,j}=\text{Cov}(x_i, x_j) \) Through this method, MaxGCP can significantly improve the efficiency of GWAS. Especially when dealing with complex diseases, it can better capture the underlying genetic signals.