An Accurate Sequence Assembly Algorithm for Livestock, Plants and Microorganism Based on Spark

Gaifang Dong,Xueliang Fu,Honghui Li,Xu Pan
DOI: https://doi.org/10.1142/s0218001417500240
IF: 1.261
2017-01-01
International Journal of Pattern Recognition and Artificial Intelligence
Abstract:Sequence Assembly is one of the important topics in bioinformatics research. Sequence assembly algorithm has always met the problems of poor assembling precision and low efficiency. In view of these two problems, this paper designs and implements a precise assembling algorithm under the strategy of finding the source of reads based on the MapReduce (SA-BR-MR) and Eulerian path algorithm. Computational results show that SA-BR-MR is more accurate than other algorithms. At the same time, SA-BR-MR calculates 54 sequences which are randomly selected from animals, plants and microorganisms with base lengths from hundreds to tens of thousands from NCBI. All matching rates of the 54 sequences are 100%. For each species, the algorithm summarizes the range of [Formula: see text] which makes the matching rates to be 100%. In order to verify the range of [Formula: see text] value of hepatitis C virus (HCV) and related variants, the randomly selected eight HCV variants are calculated. The results verify the correctness of [Formula: see text] range of hepatitis C and related variants from NCBI. The experiment results provide the basis for sequencing of other variants of the HCV. In addition, Spark platform is a new computing platform based on memory computation, which is featured by high efficiency and suitable for iterative calculation. Therefore, this paper designs and implements sequence assembling algorithm based on the Spark platform under the strategy of finding the source of reads (SA-BR-Spark). In comparison with SA-BR-MR, SA-BR-Spark shows a superior computational speed.
What problem does this paper attempt to address?