Abstract:Understanding the sequential information coded in DNA, RNA and proteins is important for both basic and applied researches in life sciences. Extensive efforts have been devoted to the research and development of DNA sequence analysis methods. The studies described in this dissertation explored new applications of existing methods in the context of the recent development of ultra-high throughput sequencing technologies. This dissertation also included new methods developed for studying gene families and human haplogroups. The theories, algorithms and tools for analyzing DNA sequence information concerning these studies are reviewed in Chapter 1 of this dissertation.With the recent development in DNA sequencing technologies, came many new research opportunities. Great challenges also came along, mainly because of the large data size of the latest high throughput sequencing technologies. The potential of these new technologies was exploited to complete a 100,000 years old ancient polar bear mitochondrial genome. With this and some additional modern bear data, the matrilineal polar bear's divergence time was estimated to be around 130,000 years ago, which is significantly older than some recent estimates. This estimate indicated that modern polar bear matrilineal ancestors adapted to the niche polar environment within 30,000 years after the speciation event and propagated along the entire Arctic Circle for the next 100,000 years. This recent speciation and rapid expansion process is analogous to the evolution and migration of modern humans. The lineage characteristics of the latter were also briefly studied using the same technologies. (Chapter 2)Because of the increased efficiency from the latest sequencing technologies, more and more complete human mitochondrial genomes have been generated at an increasingly faster speed. Although mitochondrial haplogroups, and their classification and identification were widely used in human evolution and population studies, the current tools could not fully take advantage of the rapidly growing number of new mitochondrial genomes. An updated mitochondrial haplogroup classification system was thus developed with evolutionary models that incorporate the mitochondrial genomic variations within the human population. These variations have not been considered by previous methods, which could lead to incorrectly classified haplogroups. The variation parameters, including the whole-genome substitution rate (0.013 - 0.1 substitutions per generation), the rate heterogeneity among sites (Gamma distribution shape parameter α = 0.7078) and the percentage of invariant sites (64%), were estimated based on 7985 full-length human mitochondrial genome sequences. Haplogroups were then classified based on the corrected genetic distance estimation and modeled with position specific matrices. A new haplogroup identification system was developed based on the resulting matrices and the maximum-likelihood estimation (MLE) method, permitting fast and accurate haplogroup assignment for both known and new mitochondrial genomes. The entire system is available through the HapSearch web application (http://hapsearch.synblex.com). (Chapter 3) The latest sequencing technologies also allowed a more thorough study of stage-specific transcriptional activities. To elucidate the transcriptomic profiles and new transcriptomic activities in neural development, nine recent RNA-seq datasets corresponding to tissues/organs ranging from stem cell, embryonic brain cortex to adult whole brain were analyzed. The global similarities between the neural and stem cell transcriptomes were found on both genic and chromosomal levels. A previously undocumented high level of unannotated expression was found in mouse embryonic brain cortices, the intronic part of which was found to be strongly associated with gene ontology (GO) categories that are important for synaptogenesis and neural circuit formation. This suggested potentially novel genes, gene functions and regulatory mechanisms in early brain development. (Chapter 4)Although the speed of generating genomic sequences was increasing rapidly, the development of genome annotation was lagging behind. This slowed down or prevented a broader utilization of the newly sequenced genomes. To partially mitigate this situation, a new tool, called Phoenix, was developed for retrieving homologues of a given gene or gene family from unannotated genomes. Phoenix exhibited fast and accurate performance in simulation using known gene families' data. Its advantage was further demonstrated by correctly retrieving homologues of a gene family that has a known complex evolutionary history. This tool allows gene family studies in unannotated genomes or even partially assembled genomes. (Chapter 5)Finally, this dissertation concluded with a discussion of the intrinsic limitations and advantages of the DNA sequence analysis, along with its current and future application potentials. (Chapter 6)

Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA

DNA Sequence Data Mining Technique

Review of Machine Learning Algorithms in Differential Expression Analysis

Machine learning empowered next generation DNA sequencing: perspective and prospectus

Application of deep learning in genomics

Review of the Applications of Deep Learning in Bioinformatics

A survey on deep learning in DNA/RNA motif mining

Integration of Artificial Intelligence, Machine Learning and Deep Learning Techniques in Genomics: Review on Computational Perspectives for NGS Analysis of DNA and RNA Seq Data

Protein–DNA/RNA interactions: Machine intelligence tools and approaches in the era of artificial intelligence and big data

Application of Prior-Knowledge-bearing Learning Machine in Biological Sequence Analysis

DNA Sequencing Data Analysis.

Deep Learning in Computational Biology: Advancements, Challenges, and Future Outlook

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

Progress on deep learning in genomics

Computational Biology and Chemistry with AI and ML

Review on the Application of Artificial Intelligence in Bioinformatics

A scoping review on deep learning for next-generation RNA-Seq. data analysis

Advancements in DNA computing: exploring DNA logic systems and their biomedical applications

A DNA Sequence Alignment Tool Based on BWA and Data Mining

High-throughput sequencing technology and its application in epigenetics

Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review