Madeline H. Kowalski,Huijun Qian,Ziyi Hou,Jonathan D. Rosen,Amanda L. Tapia,Yue Shan,Deepti Jain,Maria Argos,Donna K. Arnett,Christy Avery,Kathleen C. Barnes,Lewis C. Becker,Stephanie A. Bien,Joshua C. Bis,John Blangero,Eric Boerwinkle,Donald W. Bowden,Steve Buyske,Jianwen Cai,Michael H. Cho,Seung Hoan Choi,Helene Choquet,L. Adrienne Cupples,Mary Cushman,Michelle Daya,Paul S. de Vries,Patrick T. Ellinor,Nauder Faraday,Myriam Fornage,Stacey Gabriel,Santhi K. Ganesh,Misa Graff,Namrata Gupta,Jiang He,Susan R. Heckbert,Bertha Hidalgo,Chani J. Hodonsky,Marguerite R. Irvin,Andrew D. Johnson,Eric Jorgenson,Robert Kaplan,Sharon L. R. Kardia,Tanika N. Kelly,Charles Kooperberg,Jessica A. Lasky-Su,Ruth J. F. Loos,Steven A. Lubitz,Rasika A. Mathias,Caitlin P. McHugh,Courtney Montgomery,Jee-Young Moon,Alanna C. Morrison,Nicholette D. Palmer,Nathan Pankratz,George J. Papanicolaou,Juan M. Peralta,Patricia A. Peyser,Stephen S. Rich,Jerome Rotter,Edwin K. Silverman,Jennifer A. Smith,Nicholas L. Smith,Kent D. Taylor,Timothy A. Thornton,Hemant K. Tiwari,Russell P. Tracy,Tao Wang,Scott T. Weiss,Lu-Chen Weng,Kerri L. Wiggins,James G. Wilson,Lisa R. Yanek,Sebastian Zollner,Kari E. North,Paul L. Auer,Laura M. Raffield,Alexander P. Reiner,Yun Li

Abstract:Most genome-wide association and fine-mapping studies to date have been conducted in individuals of European descent, and genetic studies of populations of Hispanic/Latino and African ancestry are still limited. In addition to the limited inclusion of these populations in genetic studies, these populations have more complex linkage disequilibrium structure that may reduce the number of variants associated with a phenotype. In order to better define the genetic architecture of these understudied populations, we leveraged >100,000 phased sequences available from deep-coverage whole genome sequencing through the multi-ethnic NHLBI Trans-Omics for Precision Medicine (TOPMed) program to impute genotypes into admixed African and Hispanic/Latino samples with commercial genome-wide genotyping array data. We demonstrate that using TOPMed sequencing data as the imputation reference panel improves genotype imputation quality in these populations, which subsequently enhances gene-mapping power for complex traits. For rare variants with minor allele frequency (MAF) < 0.5%, we observed a 2.3 to 6.1-fold increase in the number of well-imputed variants, with 11-34% improvement in average imputation quality, compared to the state-of-the-art 1000 Genomes Project Phase 3 and Haplotype Reference Consortium reference panels, respectively. Impressively, even for extremely rare variants with sample minor allele count <10 (including singletons) in the imputation target samples, average information content rescued was >86%. Subsequent association analyses of TOPMed reference panel-imputed genotype data with hematological traits (hemoglobin (HGB), hematocrit (HCT), and white blood cell count (WBC)) in ~20,000 self-identified African descent individuals and ~23,000 self-identified Hispanic/Latino individuals identified associations with two rare variants in the HBB gene (rs33930165 with higher WBC (p=8.1×10 −12 ) in African populations, rs11549407 with lower HGB (p=1.59×10 −12 ) and HCT (p=1.13×10 −9 ) in Hispanics/Latinos). By comparison, neither variant would have been genome-wide significant if either 1000 Genomes Project Phase 3 or Haplotype Reference Consortium reference panels had been used for imputation. Our findings highlight the utility of TOPMed imputation reference panel for identification of novel associations between rare variants and complex traits not previously detected in similar sized genome-wide studies of under-represented African and Hispanic/Latino populations. Author summary Admixed African and Hispanic/Latino populations remain understudied in genome-wide association and fine-mapping studies of complex diseases. These populations have more complex linkage disequilibrium (LD) structure that can impair mapping of variants associated with complex diseases and their risk factors. Genotype imputation represents an approach to improve genome coverage, especially for rare or ancestry-specific variation; however, these understudied populations also have smaller relevant imputation reference panels that need to be expanded to represent their more complex LD patterns. In this study, we leveraged >100,000 phased sequences generated from the multi-ethnic NHLBI TOPMed project to impute in admixed cohorts encompassing ~20,000 individuals of African ancestry (AAs) and ~23,000 Hispanics/Latinos. We demonstrated substantially higher imputation quality for low frequency and rare variants in comparison to the state-of-the-art reference panels (1000 Genomes Project and Haplotype Reference Consortium). Association analyses of ~35 million (AAs) and ~27 million (Hispanics/Latinos) variants passing stringent post-imputation filtering with quantitative hematological traits led to the discovery of associations with two rare variants in the HBB gene; one of these variants was replicated in an independent sample, and the other is known to cause anemia in the homozygous state. By comparison, the same HBB variants would not have been genome-wide significant using other state-of-the-art reference panels due to lower imputation quality. Our findings demonstrate the power of the TOPMed whole genome sequencing data for imputation and subsequent association analysis in admixed African and Hispanic/Latino populations.

Predicting discovery rates of genomic features

More for less: Predicting and maximizing genetic variant discovery via Bayesian nonparametrics

Imputation of Coding Variants in African Americans: Better Performance Using Data from the Exome Sequencing Project

An Efficient Sufficient Dimension Reduction Method for Identifying Genetic Variants of Clinical Significance

Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects

Imputation of Exome Sequence Variants into Population- Based Samples and Blood-Cell-trait-associated Loci in African Americans: NHLBI GO Exome Sequencing Project.

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

A Probabilistic Model to Predict Clinical Phenotypic Traits from Genome Sequencing

On the cross-population generalizability of gene expression prediction models

Allele age estimators designed for whole genome datasets show only a modest decrease in accuracy when applied to whole exome datasets

Double trouble: Predicting new variant counts across two heterogeneous populations

Efficiency of trans-ethnic genome-wide meta-analysis and fine-mapping

Integrative Analysis of Sequencing and Array Genotype Data for Discovering Disease Associations with Rare Mutations

Large-Scale Validation of Single Nucleotide Polymorphisms in Gene Regions

Evaluation of ancient DNA imputation: a simulation study

Efficient Utilization of Rare Variants for Detection of Disease-Related Genomic Regions

Use of >100,000 NHLBI Trans-Omics for Precision Medicine (topmed) Consortium Whole Genome Sequences Improves Imputation Quality and Detection of Rare Variant Associations in Admixed African and Hispanic/Latino Populations

Estimation of demography and mutation rates from one million haploid genomes

High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation

The effect of single nucleotide polymorphism identification strategies on estimates of linkage disequilibrium.

Abcd: Arbitrary Coverage Design for Sequencing-Based Genetic Studies