Detecting Complex Indels With Wide Length-Spectrum From The Third Generation Sequencing Data

Xuanping Zhang,Hengwei Chen,Rong Zhang,Jingwen Pei,Yixuan Wang,Zhongmeng Zhao,Yi Huang,Jiayin Wang
DOI: https://doi.org/10.1109/BIBM.2017.8217965
2017-01-01
Abstract:Structural variations are a complex collection of mutations, many of which are reported to associated to complex traits. Recent research reports a rare case of structural variants, complex indels, which may contribute to carcinogenesis. A complex indel often presents multiple inserted nucleotides in a deleted region. Due to the limitations on both data and algorithm, existing approaches could only detect complex indels with the length shorter than 80bps; however, the longer ones are considered to imply stronger impact. In this paper, we propose a novel algorithm, SVseq3, which handles the PacBio data and identifies the long complex indels. The algorithm captures the BLASR alignment results and locates the suspicious areas of complex indels by clustering. An improved similarity hash-based framework is then constructed. For each suspicious area, a continuing-seed strategy is adopted to split the inserted fragments and obtain the original locations. The mapped segments, which consist of a series of seeds, are used to further squeeze the intermediate breakpoints and identify the forms of the complex indels. SVseq3 is able to detect long complex indels and the complex indels with multiple sources of inserted fragments. We test SVseq3 on multiple datasets with different simulation configurations and compare it to the existing methods. The experiment results demonstrate that SVseq3 outperforms the existing approaches. The sensitivity and positive-predictive rates are able to reach around 70% and 85% in some common simulation settings, respectively.
What problem does this paper attempt to address?