vcfpp: a C++ API for rapid processing of the Variant Call Format

Zilong Li
DOI: https://doi.org/10.1101/2023.10.12.555914
2024-01-16
Abstract:Given the widespread use of the variant call format (VCF/BCF) coupled with continuous surge in big data, there remains a perpetual demand for fast and flexible methods to manipulate these comprehensive formats across various programming languages. Many bioinformatic tools were developed in C++ to ensure high performance and modern C++ standards offer an ever expanding libraries to ease program development. This work presents vcfpp, a C++ API of HTSlib in a single file, providing an intuitive interface to manipulate VCF/BCF files rapidly and safely, in addition to being portable. Moreover, this work introduces the vcfppR package to demonstrate the development of a high performance R package with vcfpp, allowing for rapid and straightforward variants analyses. In the benchmarking, with the compressed VCF of 3202 samples and one million variants as input, the dynamic script using vcfppR is only 1.3 slower than its compiled C++ counterpart vcfpp, whereas the Python API cyvcf2 is 1.9 slower when streaming a variant analysis with little memory. Lastly, in a two-step setting where the whole VCF content is loaded first, vcfppR demonstrates a 101 speed improvement over vcfR and even more folds than data.table in processing genotypes.
Bioinformatics
What problem does this paper attempt to address?