Abstract:BACKGROUND:The high-throughput sequencing technology, RNA-Seq, has been widely used to quantify gene and isoform expression in the study of transcriptome in recent years. Accurate expression measurement from the millions or billions of short generated reads is obstructed by difficulties. One is ambiguous mapping of reads to reference transcriptome caused by alternative splicing. This increases the uncertainty in estimating isoform expression. The other is non-uniformity of read distribution along the reference transcriptome due to positional, sequencing, mappability and other undiscovered sources of biases. This violates the uniform assumption of read distribution for many expression calculation approaches, such as the direct RPKM calculation and Poisson-based models. Many methods have been proposed to address these difficulties. Some approaches employ latent variable models to discover the underlying pattern of read sequencing. However, most of these methods make bias correction based on surrounding sequence contents and share the bias models by all genes. They therefore cannot estimate gene- and isoform-specific biases as revealed by recent studies.RESULTS:We propose a latent variable model, NLDMseq, to estimate gene and isoform expression. Our method adopts latent variables to model the unknown isoforms, from which reads originate, and the underlying percentage of multiple spliced variants. The isoform- and exon-specific read sequencing biases are modeled to account for the non-uniformity of read distribution, and are identified by utilizing the replicate information of multiple lanes of a single library run. We employ simulation and real data to verify the performance of our method in terms of accuracy in the calculation of gene and isoform expression. Results show that NLDMseq obtains competitive gene and isoform expression compared to popular alternatives. Finally, the proposed method is applied to the detection of differential expression (DE) to show its usefulness in the downstream analysis.CONCLUSIONS:The proposed NLDMseq method provides an approach to accurately estimate gene and isoform expression from RNA-Seq data by modeling the isoform- and exon-specific read sequencing biases. It makes use of a latent variable model to discover the hidden pattern of read sequencing. We have shown that it works well in both simulations and real datasets, and has competitive performance compared to popular methods. The method has been implemented as a freely available software which can be found at https://github.com/PUGEA/NLDMseq.

Comprehensive evaluation of RNA-seq quantification methods for linearity

RNAflow: An Effective and Simple RNA-Seq Differential Gene Expression Pipeline Using Nextflow

Comparing the Normalization Methods for the Differential Analysis of Illumina High-Throughput RNA-Seq Data

Fourteen years of cellular deconvolution: methodology, applications, technical evaluation and outstanding challenges

Unraveling the complexity: understanding the deconvolutions of RNA-seq data

Deconvolution of Base Pair Level RNA-Seq Read Counts for Quantification of Transcript Expression Levels

Comprehensive evaluation of differential expression analysis methods for RNA-seq data

Limitations of Alignment-Free Tools in Total RNA-seq Quantification.

Evaluation and comparison of computational tools for RNA-seq isoform quantification

A Novel Computational Complete Deconvolution Method Using RNA-seq Data

A comparison of methods for differential expression analysis of RNA-seq data

Modeling and analysis of RNA-seq data: a review from a statistical perspective

An updated State-of-the-Art Overview of transcriptomic Deconvolution Methods

Union Exon Based Approach for Rna-Seq Gene Quantification: to Be or Not to Be?

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

Statistical Modeling of RNA-Seq Data

PennSeq: Accurate Isoform-Specific Gene Expression Quantification in RNA-Seq by Modeling Non-Uniform Read Distribution

Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction

Improving RNA-Seq Expression Estimation by Modeling Isoform- and Exon-Specific Read Sequencing Rate

Influence of RNA extraction methods and library selection schemes on RNA-seq data

CDSeq: A Novel Complete Deconvolution Method for Dissecting Heterogeneous Samples Using Gene Expression Data.