Too many needles in this haystack: algorithms for the analysis of next generation sequence data

Ting Chen,Mourad Tade Souaiaia
2012-01-01
Abstract:The development of second-generation sequencing (SGS) technology has provided scientists with a myriad of opportunities as well as new challenges. SGS machines are capable of sequencing billions of short reads at a fraction of the cost and time in comparison to older technology. Often, the study of sequence data begins with the alignment of billions of short dna reads to the 3 billion base pair human reference genome, a daunting computational task, especially if the error-rate between the reads and reference is high. For this reason, PerM was developed to use periodic spaced seeds to efficiently and accurately provide highly sensitive ungapped alignment for Illumina and SOLiD reads. Inexact alignments are often the most interesting biologically, because mismatches between the read and reference are often the result of genetic variation. To accurately detect and discern variation from machine errors, we developed ComB, which iteratively applies Bayesian statistics to color or base alignment to accurately determine mutation probability. This allowed us to study a host of biological phenomena which result in rare nucleotide differences, including single nucleotide polymorphisms (SNPs), RNA-editing, and allele-specific expression. DNA-methylation of cytosine residues also produces single-base mismatches when dna is treated with sodium-bisulte which changes all unmethylated cytosine residues to thymine. To accurately estimate methylation rates from sodium bisulte treated dna we developed FadE, an algorithm which uses Newton-Raphson optimization to estimate the methylation rate at every cytosine residue in the genome. Finally, we have applied all our statistical tools to study human mRNA editing, and have shown that RNA editing in human brain tissue occurs at a much lower rate than previously thought.
What problem does this paper attempt to address?