Abstract:Background: With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives. Findings: To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies. Conclusion: GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.

A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression

A Randomized Optimal k -mer Indexing Approach for Efficient Parallel Genome Sequence Compression

Gene Sequence Alignment on a Public Computing Platform

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Pareto Optimal Compression of Genomic Dictionaries, with or without Random Access in Main Memory

Reference-based genome compression using the longest matched substrings with parallelization consideration

A new efficient referential genome compression technique for FastQ files

A Novel Compression Algorithm for High-Throughput DNA Sequence Based on Huffman Coding Method

GVC: efficient random access compression for gene sequence variations

GSC: efficient lossless compression of VCF files with fast query

Genome Compression Against a Reference

AMGC: Adaptive match-based genomic compression algorithm

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Investigating Memory Optimization of Hash-Index for Next Generation Sequencing on Multi-Core Architecture

Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression

Genbit Compress Tool(GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio(bits/base) of Genomes

A Compressed Self-Index for Genomic Databases

Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences

Accelerating K-mer Frequency Counting with GPU and Non-Volatile Memory

Generalized compression and compressive search of large datasets

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism