Abstract:Understanding the sequential information coded in DNA, RNA and proteins is important for both basic and applied researches in life sciences. Extensive efforts have been devoted to the research and development of DNA sequence analysis methods. The studies described in this dissertation explored new applications of existing methods in the context of the recent development of ultra-high throughput sequencing technologies. This dissertation also included new methods developed for studying gene families and human haplogroups. The theories, algorithms and tools for analyzing DNA sequence information concerning these studies are reviewed in Chapter 1 of this dissertation.With the recent development in DNA sequencing technologies, came many new research opportunities. Great challenges also came along, mainly because of the large data size of the latest high throughput sequencing technologies. The potential of these new technologies was exploited to complete a 100,000 years old ancient polar bear mitochondrial genome. With this and some additional modern bear data, the matrilineal polar bear's divergence time was estimated to be around 130,000 years ago, which is significantly older than some recent estimates. This estimate indicated that modern polar bear matrilineal ancestors adapted to the niche polar environment within 30,000 years after the speciation event and propagated along the entire Arctic Circle for the next 100,000 years. This recent speciation and rapid expansion process is analogous to the evolution and migration of modern humans. The lineage characteristics of the latter were also briefly studied using the same technologies. (Chapter 2)Because of the increased efficiency from the latest sequencing technologies, more and more complete human mitochondrial genomes have been generated at an increasingly faster speed. Although mitochondrial haplogroups, and their classification and identification were widely used in human evolution and population studies, the current tools could not fully take advantage of the rapidly growing number of new mitochondrial genomes. An updated mitochondrial haplogroup classification system was thus developed with evolutionary models that incorporate the mitochondrial genomic variations within the human population. These variations have not been considered by previous methods, which could lead to incorrectly classified haplogroups. The variation parameters, including the whole-genome substitution rate (0.013 - 0.1 substitutions per generation), the rate heterogeneity among sites (Gamma distribution shape parameter α = 0.7078) and the percentage of invariant sites (64%), were estimated based on 7985 full-length human mitochondrial genome sequences. Haplogroups were then classified based on the corrected genetic distance estimation and modeled with position specific matrices. A new haplogroup identification system was developed based on the resulting matrices and the maximum-likelihood estimation (MLE) method, permitting fast and accurate haplogroup assignment for both known and new mitochondrial genomes. The entire system is available through the HapSearch web application (http://hapsearch.synblex.com). (Chapter 3) The latest sequencing technologies also allowed a more thorough study of stage-specific transcriptional activities. To elucidate the transcriptomic profiles and new transcriptomic activities in neural development, nine recent RNA-seq datasets corresponding to tissues/organs ranging from stem cell, embryonic brain cortex to adult whole brain were analyzed. The global similarities between the neural and stem cell transcriptomes were found on both genic and chromosomal levels. A previously undocumented high level of unannotated expression was found in mouse embryonic brain cortices, the intronic part of which was found to be strongly associated with gene ontology (GO) categories that are important for synaptogenesis and neural circuit formation. This suggested potentially novel genes, gene functions and regulatory mechanisms in early brain development. (Chapter 4)Although the speed of generating genomic sequences was increasing rapidly, the development of genome annotation was lagging behind. This slowed down or prevented a broader utilization of the newly sequenced genomes. To partially mitigate this situation, a new tool, called Phoenix, was developed for retrieving homologues of a given gene or gene family from unannotated genomes. Phoenix exhibited fast and accurate performance in simulation using known gene families' data. Its advantage was further demonstrated by correctly retrieving homologues of a gene family that has a known complex evolutionary history. This tool allows gene family studies in unannotated genomes or even partially assembled genomes. (Chapter 5)Finally, this dissertation concluded with a discussion of the intrinsic limitations and advantages of the DNA sequence analysis, along with its current and future application potentials. (Chapter 6)

Recurrence Time Statistics: Versatile Tools for Genomic DNA Sequence Analysis

Deciphering the Structures of Genomic DNA Sequences Using Recurrence Time Statistics

Sequence Repetitiveness Quantification and De Novo Repeat Detection by Weighted K-Mer Coverage.

A method to predict clustered repeats in Salmonella genomes

A New Statistic for Efficient Detection of Repetitive Sequences

Identification of repeats in DNA sequences using nucleotide distribution uniformity

A Fast Exact Repeats Search Algorithm for Genome Analysis.

Comparison Of Various Algorithms For Recognizing Short Coding Sequences Of Human Genes

Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences.

Integrated entropy-based approach for analyzing exons and introns in DNA sequences.

Repetitive DNA Sequence Detection and Its Role in the Human Genome

A novel exon finding algorithm based on the 3-base periodicity analysis of genome information

Building Innovative Representations of DNA Sequences to Facilitate Gene Finding.

Recognizing Shorter Coding Regions of Human Genes Based on the Statistics of Stop Codons.

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

DNA Origami-Enabled Gene Localization of Repetitive Sequences

SM-RCNV: a Statistical Method to Detect Recurrent Copy Number Variations in Sequenced Samples

Near-sigmoid Modeling to Simultaneously Profile Genome-wide DNA Replication Timing and Efficiency in Single DNA Replication Microarray Studies

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

Differential distribution of simple sequence repeats in eukaryotic genome sequences

Challenges in Detecting Somatic Recombination of Repeat Elements: Insights from Short and Long Read Datasets