Abstract:Understanding the sequential information coded in DNA, RNA and proteins is important for both basic and applied researches in life sciences. Extensive efforts have been devoted to the research and development of DNA sequence analysis methods. The studies described in this dissertation explored new applications of existing methods in the context of the recent development of ultra-high throughput sequencing technologies. This dissertation also included new methods developed for studying gene families and human haplogroups. The theories, algorithms and tools for analyzing DNA sequence information concerning these studies are reviewed in Chapter 1 of this dissertation.With the recent development in DNA sequencing technologies, came many new research opportunities. Great challenges also came along, mainly because of the large data size of the latest high throughput sequencing technologies. The potential of these new technologies was exploited to complete a 100,000 years old ancient polar bear mitochondrial genome. With this and some additional modern bear data, the matrilineal polar bear's divergence time was estimated to be around 130,000 years ago, which is significantly older than some recent estimates. This estimate indicated that modern polar bear matrilineal ancestors adapted to the niche polar environment within 30,000 years after the speciation event and propagated along the entire Arctic Circle for the next 100,000 years. This recent speciation and rapid expansion process is analogous to the evolution and migration of modern humans. The lineage characteristics of the latter were also briefly studied using the same technologies. (Chapter 2)Because of the increased efficiency from the latest sequencing technologies, more and more complete human mitochondrial genomes have been generated at an increasingly faster speed. Although mitochondrial haplogroups, and their classification and identification were widely used in human evolution and population studies, the current tools could not fully take advantage of the rapidly growing number of new mitochondrial genomes. An updated mitochondrial haplogroup classification system was thus developed with evolutionary models that incorporate the mitochondrial genomic variations within the human population. These variations have not been considered by previous methods, which could lead to incorrectly classified haplogroups. The variation parameters, including the whole-genome substitution rate (0.013 - 0.1 substitutions per generation), the rate heterogeneity among sites (Gamma distribution shape parameter α = 0.7078) and the percentage of invariant sites (64%), were estimated based on 7985 full-length human mitochondrial genome sequences. Haplogroups were then classified based on the corrected genetic distance estimation and modeled with position specific matrices. A new haplogroup identification system was developed based on the resulting matrices and the maximum-likelihood estimation (MLE) method, permitting fast and accurate haplogroup assignment for both known and new mitochondrial genomes. The entire system is available through the HapSearch web application (http://hapsearch.synblex.com). (Chapter 3) The latest sequencing technologies also allowed a more thorough study of stage-specific transcriptional activities. To elucidate the transcriptomic profiles and new transcriptomic activities in neural development, nine recent RNA-seq datasets corresponding to tissues/organs ranging from stem cell, embryonic brain cortex to adult whole brain were analyzed. The global similarities between the neural and stem cell transcriptomes were found on both genic and chromosomal levels. A previously undocumented high level of unannotated expression was found in mouse embryonic brain cortices, the intronic part of which was found to be strongly associated with gene ontology (GO) categories that are important for synaptogenesis and neural circuit formation. This suggested potentially novel genes, gene functions and regulatory mechanisms in early brain development. (Chapter 4)Although the speed of generating genomic sequences was increasing rapidly, the development of genome annotation was lagging behind. This slowed down or prevented a broader utilization of the newly sequenced genomes. To partially mitigate this situation, a new tool, called Phoenix, was developed for retrieving homologues of a given gene or gene family from unannotated genomes. Phoenix exhibited fast and accurate performance in simulation using known gene families' data. Its advantage was further demonstrated by correctly retrieving homologues of a gene family that has a known complex evolutionary history. This tool allows gene family studies in unannotated genomes or even partially assembled genomes. (Chapter 5)Finally, this dissertation concluded with a discussion of the intrinsic limitations and advantages of the DNA sequence analysis, along with its current and future application potentials. (Chapter 6)

Large Scale Data Analysis for Computational Biochemistry

Computational Genome Analysis

Analysis of protein-protein interactions using multiple biological data sets

Using Iterative Cluster Merging with Improved Gap Statistics to Perform Online Phenotype Discovery in the Context of High-Throughput RNAi Screens

Probabilistic analysis of the human transcriptome with side information

Too many needles in this haystack: algorithms for the analysis of next generation sequence data

High-throughput protein analysis integrating bioinformatics and experimental assays

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

Computational analysis of epitope-specific T-cell repertoires.

Geometric combinatorics and computational molecular biology: branching polytopes for RNA sequences

Large-scale data analysis for robotic yeast one-hybrid platforms and multi-disciplinary studies using GateMultiplex

RNAprofiling 2.0: Enhanced cluster analysis of structural ensembles

Efficient Algorithms in Analyzing Genomic Data

Simplified, open‐source analysis of DNA‐binding proteins

Accelerated simulations of RNA clustering: a systematic study of repeat sequences

A Computational Pipeline for the Extraction of Actionable Biological Information From NGS-Phage Display Experiments

Analyzing large-scale DNA Sequences on Multi-core Architectures

Computational identification of protein binding sites on RNAs using high-throughput RNA structure-probing data.

Computational Analysis of RNA-Protein Interactions via Deep Sequencing.

Understanding large scale sequencing datasets through changes to protein folding

Molecular dynamics simulations and analysis for bioinformatics undergraduate students