Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data

Drew DeHaas,Ziqing Pan,Xinzhu Wei
DOI: https://doi.org/10.1101/2024.04.23.590800
2024-08-21
Abstract:Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable. For example, the UK Biobank polymorphism data of 200,000 phased whole genomes has exceeded 350 terabytes (TB) in Variant Call Format (VCF), cumbersome and inefficient to work with. To mitigate the computational burden, we introduce the Genotype Representation Graph (GRG), an extremely compact data structure to losslessly present phased whole-genome polymorphisms. A GRG is a fully connected hierarchical graph that exploits variant-sharing across samples, leveraging ideas inspired by Ancestral Recombination Graphs. Capturing variant-sharing in a multitree structure compresses biobank-scale human data to the point where it can fit in a typical server'ss RAM (5-26 gigabytes (GB) per chromosome), and enables graph-traversal algorithms to trivially reuse computed values, both of which can significantly reduce computation time. We have developed a command-line tool and a library usable via both C++ and Python for constructing and processing GRG files which scales to a million whole genomes. It takes 160GB disk space to encode the information in 200,000 UK Biobank phased whole genomes as a GRG, more than 13 times smaller than the size of compressed VCF. We show that summaries of genetic variants such as allele frequency and association effect can be computed on GRG via graph traversal that runs significantly faster than all tested alternatives, including vcf.gz, PLINK BED, tree sequence, XSI, and Savvy. Furthermore, GRG is particularly suitable for doing repeated calculations and interactive data analysis. We anticipate that GRG-based algorithms will improve the scalability of various types of computation and generally lower the cost of analyzing large genomic datasets.
Genetics
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the efficiency and cost issues of current large - scale genomic data storage and processing. Specifically: 1. **Data storage problem**: With the progress of gene sequencing technology and the increasing interest in research on the association between genetics and diseases, a large amount of whole - human - genome data has been generated. For example, the 200,000 phased whole - genome variation data in the UK Biobank exceeds 350 terabytes (TB) in the Variant Call Format (VCF) format, which makes data storage and processing very difficult and inefficient. 2. **Data processing efficiency problem**: Traditional ways of storing gene polymorphism data (such as VCF and BGEN formats) are convenient to use, but they have obvious shortcomings when dealing with large - scale data sets. These formats usually organize data in tabular form, that is, rows represent samples and columns represent genetic variations, but this structure becomes unsustainable when dealing with large - scale data, not only occupying a large amount of storage space but also being slow in processing. To address these problems, the author introduced a new data structure - the Genotype Representation Graph (GRG). GRG is a highly compact data structure that can losslessly represent large - scale whole - genome polymorphism data. The main features of GRG include: - **Efficient compression**: By taking advantage of variation sharing among samples, GRG can compress large - scale human gene data to the extent that it can be stored in the RAM of a typical server (5 - 26 gigabytes (GB) per chromosome). - **Fast processing**: GRG supports graph traversal algorithms and can reuse calculated values, thereby significantly reducing the calculation time. - **Scalability**: GRG supports the construction and processing of million - level whole - genome data, and shows faster speed and lower cost than other formats (such as vcf.gz, PLINK BED, tree sequence, XSI and Savvy) when dealing with large - scale data sets. In conclusion, this paper aims to solve the storage and processing efficiency problems in large - scale genomic data analysis by introducing GRG, thereby improving the scalability of various computing tasks and reducing costs.