Abstract:Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models. Principal component analysis (PCA) is a widely used technique in human genetics research. One of its most frequent applications is in the context of genetic association studies, wherein researchers use PCA to infer, and then adjust for, the genetic ancestry of study participants. Although a powerful approach, prior work has shown that PCA sometimes captures other features or data quality issues, and pre-processing steps have been suggested to address these concerns. However, the utility and downstream implications of this recommended pre-processing are not fully understood, nor are these steps universally implemented. Moreover, the vast majority of prior work in this area was conducted in studies that exclusively included individuals of European ancestry. Here, we revisit this work in the context of admixed populations—populations with diverse, mixed ancestry that have been largely underrepresented in genetics research to date. We demonstrate the unique concerns that can arise in this context and illustrate the detrimental effects that including principal components in genetic association study models can have when not implemented carefully. Altogether, we hope our work serves as a reminder of the care that must be taken—including careful pre-processing, diagnostics, and modeling choices—when implementing PCA in admixed populations and beyond.

On the Substructure Controls in Rare Variant Analysis: Principal Components or Variance Components?

Population structure analysis using rare and common functional variants

Power Analysis of Principal Components Regression in Genetic Association Studies.

Rare Variant Testing Across Methods and Thresholds Using the Multi-Kernel Sequence Kernel Association Test (MK-SKAT).

Rare variant association tests for ancestry-matched case-control data based on conditional logistic regression

Approach of Fusing Multiple Tests to Analyzing Rare Genetic Variants

Bayesian LASSO for population stratification correction in rare haplotype association studies

Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test

Adjusting for principal components can induce collider bias in genome-wide association studies

Robust Genomic Control and Robust Delta Centralization Tests for Case-Control Association Studies

Simultaneous Analysis of Common and Rare Variants in Complex Traits: Application to SNPs (Scarvasnp)

Efficient Utilization of Rare Variants for Detection of Disease-Related Genomic Regions

A Comparison of Latent Class Model and Principal Component Analysis in the Application of Rare Variants Association Studies

Adjusting for principal components can induce spurious associations in genome-wide association studies in admixed populations

ACAT: A Fast and Powerful P-value Combination Method for Rare-variant Analysis in Sequencing Studies

A Robust and Powerful Set-Valued Approach to Rare Variant Association Analyses of Secondary Traits in Case-Control Sequencing Studies.

Learning the kernel for rare variant genetic association test

A Robust Model-free Approach for Rare Variants Association Studies Incorporating Gene-Gene and Gene-Environmental Interactions

Fast and efficient correction for population stratification in multi-locus genome-wide association studies

Rare Variants Analysis by Risk-Based Variable-Threshold Method

Comparison Of Population-Based Association Study Methods Correcting For Population Stratification