Abstract:Background: The de novo assembly of transcriptomes from short shotgun sequences raises challenges due to random and non-random sequencing biases and inherent transcript complexity. We sought to define a pipeline for de novo transcriptome assembly to aid researchers working with emerging model systems where well annotated genome assemblies are not available as a reference. To detail this experimental and computational method, we used early embryos of the sea anemone, Nematostella vectensis, an emerging model system for studies of animal body plan evolution. We performed RNA-seq on embryos up to 24 h of development using Illumina HiSeq technology and evaluated independent de novo assembly methods. The resulting reads were assembled using either the Trinity assembler on all quality controlled reads or both the Velvet and Oases assemblers on reads passing a stringent digital normalization filter. A control set of mRNA standards from the National Institute of Standards and Technology (NIST) was included in our experimental pipeline to invest our transcriptome with quantitative information on absolute transcript levels and to provide additional quality control. Results: We generated >200 million paired-end reads from directional cDNA libraries representing well over 20 Gb of sequence. The Trinity assembler pipeline, including preliminary quality control steps, resulted in more than 86% of reads aligning with the reference transcriptome thus generated. Nevertheless, digital normalization combined with assembly by Velvet and Oases required far less computing power and decreased processing time while still mapping 82% of reads. We have made the raw sequencing reads and assembled transcriptome publically available. Conclusions: Nematostella vectensis was chosen for its strategic position in the tree of life for studies into the origins of the animal body plan, however, the challenge of reference-free transcriptome assembly is relevant to all systems for which well annotated gene models and independently verified genome assembly may not be available. To navigate this new territory, we have constructed a pipeline for library preparation and computational analysis for de novo transcriptome assembly. The gene models defined by this reference transcriptome define the set of genes transcribed in early Nematostella development and will provide a valuable dataset for further gene regulatory network investigations.

Improving transcriptome construction in non-model organisms: integrating manual and automated gene definition in Emiliania huxleyi

Combining independent de novo assemblies optimizes the coding transcriptome for nonconventional model eukaryotic organisms

Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing

Illuminating the dark side of the human transcriptome with long read transcript sequencing

De novo assembly of transcriptomes and differential gene expression analysis using short-read data from emerging model organisms – a brief guide

TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

UNAGI: Yeast Transcriptome Reconstruction and Gene Discovery Using Nanopore Sequencing

Improving transcriptome assembly through error correction of high-throughput sequence reads

OMAnnotator: a novel approach to building an annotated consensus genome sequence

Augmenting transcriptome assembly combinatorially

A quantitative reference transcriptome for Nematostella vectensis early embryonic development: a pipeline for de novo assembly in emerging model systems

Building better genome annotations across the tree of life

UnigeneFinder: An automated pipeline for gene calling from transcriptome assemblies without a reference genome

A Manual Curation Strategy to Improve Genome Annotation: Application to a Set of Haloarchael Genomes

Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study

Exploring the transcriptome of non-model oleaginous microalga Dunaliella tertiolecta through high-throughput sequencing and high performance computing

A beginner’s guide to manual curation of transposable elements

Complete mitochondrial genomes from transcriptomes: assessing pros and cons of data mining for assembling new mitogenomes

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Roast: a tool for reference-free optimization of supertranscriptome assemblies

Knowledge-Based Reconstruction of Mrna Transcripts with Short Sequencing Reads for Transcriptome Research