Mining Statistically-Solid K-Mers for Accurate NGS Error Correction

Liang Zhao,Jin Xie,Lin Bai,Wen Chen,Mingju Wang,Zhonglei Zhang,Yiqi Wang,Zhe Zhao,Jinyan Li
DOI: https://doi.org/10.1186/s12864-018-5272-y
IF: 4.547
2018-01-01
BMC Genomics
Abstract:Background NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k -mers. A solid k -mer is a k -mer frequently occurring in NGS reads. The other k -mers are called weak k -mers. A solid k -mer does not likely contain errors, while a weak k -mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f 0 to balance the numbers of solid and weak k -mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k -mers that are likely to contain errors, and (ii) add a small subset of weak k -mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k -mers can improve the correction performance. Results We propose to use a Gamma distribution to model the frequencies of erroneous k -mers and a mixture of Gaussian distributions to model correct k -mers, and combine them to determine f 0 . To identify the two special subsets of k -mers, we use the z -score of k -mers which measures the number of standard deviations a k -mer’s frequency is from the mean. Then these statistically-solid k -mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets. Conclusion The z -score is adequate to distinguish solid k -mers from weak k -mers, particularly useful for pinpointing out solid k -mers having very low frequency. Applying z -score on k -mer can markedly improve the error correction accuracy.
What problem does this paper attempt to address?