Abstract:Understanding the sequential information coded in DNA, RNA and proteins is important for both basic and applied researches in life sciences. Extensive efforts have been devoted to the research and development of DNA sequence analysis methods. The studies described in this dissertation explored new applications of existing methods in the context of the recent development of ultra-high throughput sequencing technologies. This dissertation also included new methods developed for studying gene families and human haplogroups. The theories, algorithms and tools for analyzing DNA sequence information concerning these studies are reviewed in Chapter 1 of this dissertation.With the recent development in DNA sequencing technologies, came many new research opportunities. Great challenges also came along, mainly because of the large data size of the latest high throughput sequencing technologies. The potential of these new technologies was exploited to complete a 100,000 years old ancient polar bear mitochondrial genome. With this and some additional modern bear data, the matrilineal polar bear's divergence time was estimated to be around 130,000 years ago, which is significantly older than some recent estimates. This estimate indicated that modern polar bear matrilineal ancestors adapted to the niche polar environment within 30,000 years after the speciation event and propagated along the entire Arctic Circle for the next 100,000 years. This recent speciation and rapid expansion process is analogous to the evolution and migration of modern humans. The lineage characteristics of the latter were also briefly studied using the same technologies. (Chapter 2)Because of the increased efficiency from the latest sequencing technologies, more and more complete human mitochondrial genomes have been generated at an increasingly faster speed. Although mitochondrial haplogroups, and their classification and identification were widely used in human evolution and population studies, the current tools could not fully take advantage of the rapidly growing number of new mitochondrial genomes. An updated mitochondrial haplogroup classification system was thus developed with evolutionary models that incorporate the mitochondrial genomic variations within the human population. These variations have not been considered by previous methods, which could lead to incorrectly classified haplogroups. The variation parameters, including the whole-genome substitution rate (0.013 - 0.1 substitutions per generation), the rate heterogeneity among sites (Gamma distribution shape parameter α = 0.7078) and the percentage of invariant sites (64%), were estimated based on 7985 full-length human mitochondrial genome sequences. Haplogroups were then classified based on the corrected genetic distance estimation and modeled with position specific matrices. A new haplogroup identification system was developed based on the resulting matrices and the maximum-likelihood estimation (MLE) method, permitting fast and accurate haplogroup assignment for both known and new mitochondrial genomes. The entire system is available through the HapSearch web application (http://hapsearch.synblex.com). (Chapter 3) The latest sequencing technologies also allowed a more thorough study of stage-specific transcriptional activities. To elucidate the transcriptomic profiles and new transcriptomic activities in neural development, nine recent RNA-seq datasets corresponding to tissues/organs ranging from stem cell, embryonic brain cortex to adult whole brain were analyzed. The global similarities between the neural and stem cell transcriptomes were found on both genic and chromosomal levels. A previously undocumented high level of unannotated expression was found in mouse embryonic brain cortices, the intronic part of which was found to be strongly associated with gene ontology (GO) categories that are important for synaptogenesis and neural circuit formation. This suggested potentially novel genes, gene functions and regulatory mechanisms in early brain development. (Chapter 4)Although the speed of generating genomic sequences was increasing rapidly, the development of genome annotation was lagging behind. This slowed down or prevented a broader utilization of the newly sequenced genomes. To partially mitigate this situation, a new tool, called Phoenix, was developed for retrieving homologues of a given gene or gene family from unannotated genomes. Phoenix exhibited fast and accurate performance in simulation using known gene families' data. Its advantage was further demonstrated by correctly retrieving homologues of a gene family that has a known complex evolutionary history. This tool allows gene family studies in unannotated genomes or even partially assembled genomes. (Chapter 5)Finally, this dissertation concluded with a discussion of the intrinsic limitations and advantages of the DNA sequence analysis, along with its current and future application potentials. (Chapter 6)

Too many needles in this haystack: algorithms for the analysis of next generation sequence data

Pseudo-Sanger Sequencing: Massively Parallel Production of Long and Near Error-Free Reads Using NGS Technology

A Comprehensive Evaluation of Alignment Software for Reduced Representation Bisulfite Sequencing Data

A general approach to single-nucleotide polymorphism discovery

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

DNAscan: a fast, computationally and memory efficient bioinformatics pipeline for the analysis of DNA next-generation-sequencing data

Comprehensive assessment of error correction methods for high-throughput sequencing data

Codon-Based Sequence Alignment for Mutation Analysis by High-Throughput Sequencing

Evaluation of next generation sequencing platforms for population targeted sequencing studies

GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality

Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy

Error filtering, pair assembly and error correction for next-generation sequencing reads

Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

Benchmarking software tools for trimming adapters and merging next-generation sequencing data for ancient DNA

Mixed Sequence Reader: A Program for Analyzing DNA Sequences with Heterozygous Base Calling

BatMeth: Improved Mapper for Bisulfite Sequencing Reads on DNA Methylation.

FadE: Whole Genome Methylation Analysis for Multiple Sequencing Platforms

Perm: Efficient Mapping of Short Sequencing Reads with Periodic Full Sensitive Spaced Seeds

DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation