Overcome the Limitation of Phenome-Wide Association Studies (PheWAS): Extension of PheWAS to Efficient and Robust Large-Scale ICD Codes Analysis

Ya−Chen Lin,Siwei Zhang,Tess Vessels,Lisa Bastarache,Cosmin Adrian Bejan,Ryan S Hsie,Elizabeth J Philips,Doug M Ruderfer,Jill M. Pulley,Todd L Edwards,Quinn S Wells,Jeremy L Warner,Joshua C Denny,Dan M Roden,Hakmook Kang,Yaomin Xu
DOI: https://doi.org/10.1101/2024.04.15.24305098
2024-04-19
Abstract:The Phenome-wide association studies (PheWAS) have become widely used for efficient, high-throughput evaluation of relationship between a genetic factor and a large number of disease phenotypes, typically extracted from a DNA biobank linked with electronic medical records (EMR). Phecodes, billing code-derived disease case-control status, are usually used as outcome variables in PheWAS and logistic regression has been the standard choice of analysis method. Since the clinical diagnoses in EMR are often inaccurate with errors which can lead to biases in the odds ratio estimates, much effort has been put to accurately define the cases and controls to ensure an accurate analysis. Specifically in order to correctly classify controls in the population, an exclusion criteria list for each Phecode was manually compiled to obtain unbiased odds ratios. However, the accuracy of the list cannot be guaranteed without extensive data curation process. The costly curation process limits the efficiency of large-scale analyses that take full advantage of all structured phenotypic information available in EMR. Here, we proposed to estimate relative risks (RR) instead. We first demonstrated the desired nature of RR that overcomes the inaccuracy in the controls via theoretical formula. With simulation and real data application, we further confirmed that RR is unbiased without compiling exclusion criteria lists. With RR as estimates, we are able to efficiently extend PheWAS to a larger-scale, phenome construction agnostic analysis of phenotypes, using ICD 9/10 codes, which preserve much more disease-related clinical information than Phecodes.
Health Informatics
What problem does this paper attempt to address?