MEC: Misassembly Error Correction in Contigs Based on Distribution of Paired-End Reads and Statistics of GC-contents
Binbin Wu,Min Li,Xingyu Liao,Junwei Luo,Fang-Xiang Wu,Yi Pan,Jianxin Wang
DOI: https://doi.org/10.1109/tcbb.2018.2876855
2020-01-01
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract:The de novo assembly tools aim at reconstructing genomes from next-generation sequencing (NGS) data. However, the assembly tools usually generate a large amount of contigs containing many misassemblies, which are caused by problems of repetitive regions, chimeric reads, and sequencing errors. As they can improve the accuracy of assembly results, detecting and correcting the misassemblies in contigs are appealing, yet challenging. In this study, a novel method, called MEC, is proposed to identify and correct misassemblies in contigs. Based on the insert size distribution of paired-end reads and the statistical analysis of GC-contents, MEC can identify more misassemblies accurately. We evaluate our MEC with the metrics (NA50, NGA50) on four datasets, compared it with the most available misassembly correction tools, and carry out experiments to analyze the influence of MEC on scaffolding results, which shows that MEC can reduce misassemblies effectively and result in quantitative improvements in scaffolding quality. MEC is publicly available at https://github.com/bioinfomaticsCSU/MEC.