GAUSS: a summary-statistics-based R package for accurate estimation of linkage disequilibrium for variants, Gaussian imputation, and TWAS analysis of cosmopolitan cohorts

Donghyung Lee,Silviu-Alin Bacanu

DOI: https://doi.org/10.1093/bioinformatics/btae203

IF: 5.8

2024-03-29

Bioinformatics

Abstract:Abstract Motivation As the availability of larger and more ethnically diverse reference panels grows, there is an increase in demand for ancestry-informed imputation of genome-wide association studies (GWAS), and other downstream analyses, e.g. fine-mapping. Performing such analyses at the genotype level is computationally challenging and necessitates, at best, a laborious process to access individual-level genotype and phenotype data. Summary-statistics-based tools, not requiring individual-level data, provide an efficient alternative that streamlines computational requirements and promotes open science by simplifying the re-analysis and downstream analysis of existing GWAS summary data. However, existing tools perform only disparate parts of needed analysis, have only command-line interfaces, and are difficult to extend/link by applied researchers. Results To address these challenges, we present Genome Analysis Using Summary Statistics (GAUSS)—a comprehensive and user-friendly R package designed to facilitate the re-analysis/downstream analysis of GWAS summary statistics. GAUSS offers an integrated toolkit for a range of functionalities, including (i) estimating ancestry proportion of study cohorts, (ii) calculating ancestry-informed linkage disequilibrium, (iii) imputing summary statistics of unobserved variants, (iv) conducting transcriptome-wide association studies, and (v) correcting for “Winner’s Curse” biases. Notably, GAUSS utilizes an expansive, multi-ethnic reference panel consisting of 32 953 genomes from 29 ethnic groups. This panel enhances the range and accuracy of imputable variants, including the ability to impute summary statistics of rarer variants. As a result, GAUSS elevates the quality and applicability of existing GWAS analyses without requiring access to subject-level genotypic and phenotypic information. Availability and implementation The GAUSS R package, complete with its source code, is readily accessible to the public via our GitHub repository at https://github.com/statsleelab/gauss. To further assist users, we provided illustrative use-case scenarios that are conveniently found at https://statsleelab.github.io/gauss/, along with a comprehensive user guide detailed in Supplementary Text S1.

biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology

What problem does this paper attempt to address?

The paper aims to address some key issues in genome-wide association studies (GWAS). Specifically: 1. **Ancestry Proportion Estimation in Multi-Ethnic Populations**: Accurately estimating ancestry proportions in multi-ethnic populations is crucial for subsequent analyses based on summary statistics. However, many traditional methods require individual-level genotype data, which is often unavailable due to privacy concerns. GAUSS estimates ancestry proportions in genetic association studies using only allele frequencies (AF) or association Z-scores. 2. **Ancestry-Informed Linkage Disequilibrium (LD) Calculation**: With the increasing diversity of ancestries in GWAS, accurately estimating ancestry-informed linkage disequilibrium becomes increasingly important. GAUSS provides the `computeLD()` function, which uses its extensive 33KG reference panel to calculate linkage disequilibrium values specific to different ethnic groups. 3. **Imputation of Summary Statistics for Unobserved SNPs**: Traditional genotype imputation methods require individual-level genotype data and are computationally intensive. GAUSS offers the `dist()` and `distmix()` functions to directly impute summary statistics (such as association Z-scores) for unobserved SNPs, applicable to both homogeneous and multi-ethnic populations. 4. **Transcriptome-Wide Association Studies (TWAS)**: GAUSS integrates advanced TWAS tools `jepeg()` and `jepegmix()` for homogeneous and heterogeneous populations to explore the functional links between genetic variation and complex traits. 5. **Correction for "Winner's Curse" Bias**: In genetic studies, "sub-threshold" association signals often have a greater impact on trait variation than statistically significant variants. GAUSS integrates the FIQT (False Discovery Rate Inverse Quantile Transformation) method to adjust for these biases. Through these features, GAUSS aims to simplify the re-analysis and downstream analysis of large GWAS summary statistics without the need for access to individual-level genotype and phenotype data, thereby promoting open science and enhancing the quality and applicability of existing GWAS analyses.

GAUSS: a summary-statistics-based R package for accurate estimation of linkage disequilibrium for variants, Gaussian imputation, and TWAS analysis of cosmopolitan cohorts

HAPRAP: a haplotype-based iterative method for statistical fine mapping using GWAS summary statistics

GWASBrewer: An R Package for Simulating Realistic GWAS Summary Statistics

A high-performance computing toolset for relatedness and principal component analysis of SNP data

Integrate multiple traits to detect novel trait–gene association using GWAS summary data with an adaptive test approach

GAPIT Version 2: an Enhanced Integrated Tool for Genomic Association and Prediction

MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics

GenoTools: An Open-Source Python Package for Efficient Genotype Data Quality Control and Analysis

Gfa2bin enables graph-based GWAS by converting genome graphs to pan-genomic genotypes

SNPTransformer: a lightweight toolkit for genome-wide association studies

GIGSEA: Genotype Imputed Gene Set Enrichment Analysis Using GWAS Summary Level Data

snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

Small-group originating model: Optimized individual-level GWAS simulation featured by SLiM and using open-access data

GWAShug: a comprehensive platform for decoding the shared genetic basis between complex traits based on summary statistics

CCAFE: Estimating Case and Control Allele Frequencies from GWAS Summary Statistics

EigenGWAS: An online visualizing and interactive application for detecting genomic signatures of natural selection

The goldmine of GWAS summary statistics: a systematic review of methods and tools

Biostatistical Aspects of Whole Genome Sequencing Studies: Preprocessing and Quality Control

Accurate cross-platform GWAS analysis via two-stage imputation

qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots