Abstract:Abstract Background The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6 K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2566 putative novel non-coding genes and 1557 putative novel protein coding gene models. Conclusions Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.

Illuminating the dark side of the human transcriptome with long read transcript sequencing

Transcriptome variation in human tissues revealed by long-read sequencing

A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

High-Resolution Transcriptome Analysis with Long-Read RNA Sequencing

Knowledge-Based Reconstruction of Mrna Transcripts with Short Sequencing Reads for Transcriptome Research

Enhancing transcriptome expression quantification through accurate assignment of long RNA sequencing reads with TranSigner

UNAGI: Yeast Transcriptome Reconstruction and Gene Discovery Using Nanopore Sequencing

Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing

Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing

Enhancing novel isoform discovery: leveraging nanopore long-read sequencing and machine learning approaches

UTAP: User-friendly Transcriptome Analysis Pipeline

Transcriptomics in the RNA-seq era

Real-time transcriptomic profiling in distinct experimental conditions

5 ' Long Serial Analysis of Gene Expression (longsage) and 3 ' Longsage for Transcriptome Characterization and Genome Annotation

Complete characterization of the human immune cell transcriptome using accurate full-length cDNA sequencing

Deep annotation of long noncoding RNAs by assembling RNA-seq and small RNA-seq data

SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification

Contrasting and combining transcriptome complexity captured by short and long RNA sequencing reads

Integrating short-read and long-read single-cell RNA sequencing for comprehensive transcriptome profiling in mouse retina

Characterizing and Annotating the Genome Using RNA-seq Data