Abstract:BackgroundRNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493–2518, 2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated.ResultsIn this paper, we propose a negative binomial linear discriminant analysis for RNA-Seq data. By Bayes’ rule, we construct the classifier by fitting a negative binomial model, and propose some plug-in rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze two real RNA-Seq data sets to demonstrate the advantages of our method in real-world applications.ConclusionsWe have developed a new classifier using the negative binomial model for RNA-seq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. The proposed classifier can serve as an effective tool for classifying RNA-seq data. Based on the comparison results, we have provided some guidelines for scientists to decide which method should be used in the discriminant analysis of RNA-Seq data. R code is available at http://www.comp.hkbu.edu.hk/~xwan/NBLDA.Ror https://github.com/yangchadam/NBLDA

Sequence count data are poorly fit by the negative binomial distribution

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data

EBT: a Statistic Test Identifying Moderate Size of Significant Features with Balanced Power and Precision for Genome-Wide Rate Comparisons

Bayesian Analysis of RNA-Seq Data Using a Family of Negative Binomial Models.

Negative binomial count splitting for single-cell RNA sequencing data

Analysis of Frequency Count Data Using the Negative Binomial Distribution

A mechanistic model for the negative binomial distribution of single-cell mRNA counts

A heavy-tailed model for analyzing miRNA-seq raw read counts

Root Causal Inference from Single Cell RNA Sequencing with the Negative Binomial

Statistical inference for time course RNA-Seq data using a negative binomial mixed-effect model

A sparse negative binomial mixture model for clustering RNA-seq count data

Theoretical framework for the difference of two negative binomial distributions and its application in comparative analysis of sequencing data

Accurate inference in negative binomial regression

NBBt-test: a versatile method for differential analysis of multiple types of RNA-seq data

From Poisson Observations to Fitted Negative Binomial Distribution

Some Theoretical Comparisons of Negative Binomial and Zero-Inflated Poisson Distributions

Comparison and evaluation of statistical error models for scRNA-seq

Statistics or biology: the zero-inflation controversy about scRNA-seq data

Neglecting the impact of normalization in semi-synthetic RNA-seq data simulations generates artificial false positives

Non-parametric Bayesian modelling of digital gene expression data