RabbitQCPlus: More Efficient Quality Control for Sequencing Data.

Lifeng Yan,Zekun Yin,Hao Zhang,Zhan Zhao,Mingkai Wang,André Müller,Robin Kobus,Yanjie Wei,Beifang Niu,Bertil Schmidt,Weiguo Liu
DOI: https://doi.org/10.1109/bibm55620.2022.9995332
2022-01-01
Abstract:Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files. Furthermore, it takes less than 4 minutes to process 280GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.
What problem does this paper attempt to address?