Large Scale Data Analysis for Computational Biochemistry

Xuemei Luo
DOI: https://doi.org/10.22215/etd/2011-09639
2011-01-01
Abstract:In this thesis computational methods were developed to solve three biological problems: to design RNA/DNA pools for in vitro selection of complex aptamers; to identify transposon insertion polymorphisms by computational comparative analysis of next generation personal genome data; and to identify and predict protein complexes from protein-protein interaction networks. It is well known that using random RNA/DNA sequences as starting pools in in vitro aptamer selection experiments generally yields low complexity structures. Two computational methods were developed to generate sequence pools that exhibit higher structural complexity and can be used to increase the structural diversity of initial pools. Random Filtering increases the number of five-way junctions in RNA/DNA pools, and Genetic Filtering designs RNA/DNA pools with a specified structural distribution. DNA pools designed by these methods were shown to greatly improve access to highly complex sequence structures for aptamer selections. Structural variations in a genome are a prominent and important type of genetic variation. Among all types of structural variations, the identification of transposon insertion polymorphisms is more challenging due to the highly repetitive nature of transposon sequences. A computational method, TIP-finder, was developed to identify transposon insertions through the analysis of next generation personal genome data. The efficiency of TIP-finder was tested with simulated data and it was able to detect 88% of transposon insertions with a precision of ≥91%. Using TIP-finder to analyze six genomes, a total of 5569 transposon insertions were identified, representing the most comprehensive analysis of such type of genetic variation. The cluster editing problem is a non-deterministic polynomial-time hard problem. In this thesis, a fixed-parameter tractability based method was implemented and improved for cluster editing. This method was then applied to identify and predict protein complexes of yeast. Results showed that this method has potential to identify known protein complexes and to predict new proteins in the complexes. It also has potential to predict new edges in protein-protein interaction networks as well.
What problem does this paper attempt to address?