Abstract:The deep sequencing of second generation sequencing technology has enabled us to study complex biological structures, which have multiple DNA units simultaneously such as transcriptomics and metagenomics. Unlike general genome sequence assembly, a DNA unit of these biological structures may have multiple copies with small or substantial structural variations and/or SNPs simultaneously in an experimental sample. Therefore, the deep sequencing is necessary to figure out such variations concurrently. This dissertation focuses on de novo transcriptome assembly which requires simultaneous assembly of multiple alternatively spliced gene transcripts. In practice, the de novo transcriptome assembly is the only option for studying the transcriptome of organisms that do not have reference genome sequences, and it can also be applied to identify novel transcripts and structural variations in the gene regions of model organisms. We propose WEAV for the de novo transcriptome assembly which consists of two separate processes: clustering and assembly. WEAV reduces the complexity of RNA-seq dataset by partitioning it into clusters called clustering. WEAV simplify a diverse RNA-seq dataset, which has many genes together, into many, smaller clustered read sets, which have few genes a cluster, in the clustering process. The underlying idea is straightforward. A sequencer samples reads from random place so reads from one gene may have overlaps with others if sequencing depth is enough. The overlaps are the keys to connect reads from one gene. We can transform a dataset into a graph where each read is a node and two reads are connected by an edge when they have an overlap. Each connected component will be a clustered read set. As a result, we can assume that a cluster may have one or few genes; therefore, it will not be mixed. After this process, WEAV assembles the clustered read set with de Bruijn graph backbone, and a novel error correction process simplify the backbone with a fast mapping tool, PerM. Roughly speaking, WEAV tries to solve the historical Shortest Common Superstring problem with the graph to identify multiple alternatively spliced gene transcripts simultaneously and approaches the problem using Set Cover problem. We propose novel statistical measures to make the NP hard problem manageable. The measures are explainability based on the likelihood of sequences and correctness based on bootstrapping. We compared WEAV with other assemblers with various, simulated reads. We tested the performance by widely used measures such as specificity, sensitivity, N50, and the length of the longest sequence. After this, we tested WEAV using an experimental dataset having 58.58 million 100bp human brain transcriptome reads. WEAV assembled 156,494 contigs that were longer than 300bp. 96.3% (specificity) of these contigs were mapped onto either RefSeq, Gencode or human Genome sequences (hg19), and they covered 72% sequenced bases annotated in RefSeq and Gencode. These high sensitivity and specificity showed the exceptional power of WEAV for transcriptome assembly.

Combining independent de novo assemblies optimizes the coding transcriptome for nonconventional model eukaryotic organisms

Augmenting transcriptome assembly combinatorially

Integrated De Novo Gene Prediction and Peptide Assembly of Metagenomic Sequencing Data

Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study

Techniques for de novo sequence assembly: algorithms and experimental results

Comparison of De Novo Transcriptome Assemblers and k-mer Strategies Using the Killifish, Fundulus heteroclitus

De novo assembly of transcriptomes and differential gene expression analysis using short-read data from emerging model organisms – a brief guide

Benchmarking of next and third generation sequencing technologies and their associated algorithms for de novo genome assembly

UNAGI: Yeast Transcriptome Reconstruction and Gene Discovery Using Nanopore Sequencing

Improving transcriptome construction in non-model organisms: integrating manual and automated gene definition in Emiliania huxleyi

Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations

Building better genome annotations across the tree of life

A Scalable and Accurate Targeted Gene Assembly Tool (SAT-Assembler) for Next-Generation Sequencing Data

Optimal assembly strategies of transcriptome related to ploidies of eukaryotic organisms

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Benchmarking of bioinformatics tools for the hybrid de novo assembly of human whole-genome sequencing data

UnigeneFinder: An automated pipeline for gene calling from transcriptome assemblies without a reference genome

Is the whole greater than the sum of its parts? De novo assembly strategies for bacterial genomes based on paired-end sequencing

Phasing or purging: tackling the genome assembly of a highly heterozygous animal species in the era of high-accuracy long reads

Assembly Arena: Benchmarking RNA isoform reconstruction algorithms for nanopore sequencing

Comparison of next generation sequencing technologies for transcriptome characterization