Computational Genome Analysis

Steven S. Skiena,Ting Chen
1997-01-01
Abstract:This thesis is concerned with computational approaches to genome analysis. We discuss three biological applications: genomic rearrangements, gene recognitions, and genome sequencing, all of whose practical solutions involve interesting algorithm problems. In the genomic rearrangements, we seek to reconstruct the evolutionary history of the genome. We study the distance between genomes using fixed-length inversions and give a complete theoretical characterization for both linear and circular genomes. We also prove upper and lower bounds to the minimum distance. Pattern recognition is central to many gene recognition systems. We apply linear discriminant analysis in a special program called Pombe to identify protein coding regions in the Schizosaccharomyces pombe genome. The accuracy of gene structures we predicted is 97.2% correlation coefficient at the nucleotide level by cross validation. In a large scale genome sequencing project, we show that data structures are powerful in many pattern matching applications. We introduce a heuristic to speed up fragment assembly and implement it using a data structure called suffix array, which greatly improves the speed of overlap detection by up to 1,000 times while maintaining a high accuracy. Finally, we report a recent progress on this sequencing project and the assembly program STROLL. Compared with other widely used assemblers, STROLL is significantly faster and more reliable to handle repeat regions. In the last chapter, we point our future research to some open problems which are of great interest to both computer scientists and biologists.
What problem does this paper attempt to address?