STABIX: Summary statistic-based GWAS indexing and compression

Kristen Schneider,Simon Walker,Christopher Gignoux,Ryan M Layer
DOI: https://doi.org/10.1101/2024.11.15.623812
2024-11-15
Abstract:Genome-Wide Association Studies (GWAS) are widely used to investigate the role of genetics in disease traits, but the resulting file sizes from these studies are large, posing barriers to efficient storage, sharing, and querying. This issue is especially important for biobanks like the UK Biobank that publish GWAS for thousands of traits, increasing the volume of data that must be effectively managed. Current compression and query methods reduce file sizes and allow for quick genomic position-based queries but do not provide utility for quickly finding loci based on their summary statistics. For example, finding all SNVs in a particular p-value range would require decompressing and scanning the whole file. We propose a new tool, STABIX, which introduces summary-statistic-based queries and improves upon the standard bgzip compression and tabix query tool in both compression ratio and decompression speed. When applied to ten GWAS files from PanUKBB, STABIX created smaller compressed data and indices than tabix for all files, where bgzip and tbi files were an average of 1.2 times the size of STABIX compressed files and indexes. In the same ten files, STABIX per gene decompression was, on average 7x faster than tabix per gene decompression, and achieved faster per gene decompression times for over 99% of nearly 20,000 genes.
Bioinformatics
What problem does this paper attempt to address?