A Crowdsourcing Method For Correcting Sequencing Errors For The Third-Generation Sequencing Data

Yu Geng,Zhongmeng Zhao,Zhaofang Du,Yixuan Wang,Tian Zheng,Siyu He,Xuanping Zhang,Jiayin Wang
DOI: https://doi.org/10.1109/BIBM.2017.8217903
2017-01-01
Abstract:The third generation sequencing data exposes great advantage on read length, which extremely benefits the genomic analyses. However, the third generation sequencing data implies error models different from the ones that the second generation data brings. It is suggested to correct sequencing errors, which could significantly reduce false positives in downstream analyses. Existing error correction approaches often suffer accuracy loss when the hybrid reads present diversity or the coverage varies. In this paper, we propose a novel method based on crowdsourcing strategy, which is implemented as CLTC. CLTC is also a hybrid correction algorithm, which consists of four steps. The second generation reads are first collected and mapped to the third generation reads. Then, the base difficult level is defined to describe the diversities on a base among a group of 2nd-generation reads covered it. The capability is evaluated for each 2nd-generation read, which considers the base difficult levels across the read, the consistency among overlapped reads and the mapping quality between the 2nd- and 3rd-generation reads. A heuristic algorithm is designed for the calculation of capabilities. An expectation-maximization algorithm is finally used to compute the corrected result for each base-pair. We test CLTC on different datasets and compare to the existing approaches. The results demonstrate that CLTC is able to achieve higher accuracy and performs faster than the existing ones.
What problem does this paper attempt to address?