Abstract:BackgroundRNA-sequencing (RNA-Seq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNA-Seq data, they are not ideal due to the discrete nature of RNA-Seq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493–2518, 2011) proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated.ResultsIn this paper, we propose a negative binomial linear discriminant analysis for RNA-Seq data. By Bayes’ rule, we construct the classifier by fitting a negative binomial model, and propose some plug-in rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze two real RNA-Seq data sets to demonstrate the advantages of our method in real-world applications.ConclusionsWe have developed a new classifier using the negative binomial model for RNA-seq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. The proposed classifier can serve as an effective tool for classifying RNA-seq data. Based on the comparison results, we have provided some guidelines for scientists to decide which method should be used in the discriminant analysis of RNA-Seq data. R code is available at http://www.comp.hkbu.edu.hk/~xwan/NBLDA.Ror https://github.com/yangchadam/NBLDA

Nonparametric clustering of RNA-sequencing data

A sparse negative binomial mixture model for clustering RNA-seq count data

Clustering Count-based RNA Methylation Data Using a Nonparametric Generative Model

A Multivariate Poisson-Log Normal Mixture Model for Clustering Transcriptome Sequencing Data

Scalable nonparametric clustering with unified marker gene selection for single-cell RNA-seq data

A clustering procedure for three-way RNA sequencing data using data transformations and matrix-variate Gaussian mixture models

Omada: robust clustering of transcriptomes through multiple testing

Clustering de Novo by Gene of Long Reads from Transcriptomics Data

Single-Cell Transcriptome Data Clustering via Multinomial Modeling and Adaptive Fuzzy K-Means Algorithm

SCNMLRR: Single Cell Clustering Based on Low-rank Non-negative Matrix Factorization

De novo clustering of long reads by gene from transcriptomics data

Clustering pipeline for determining consensus sequences in targeted next-generation sequencing

Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data

RNA-clique: a method for computing genetic distances from RNA-seq data

Robust model-based clustering with gene ranking

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data

Unsupervised Cluster Analysis and Gene Marker Extraction of scRNA-seq Data Based On Non-Negative Matrix Factorization

Detecting Heterogeneity in Single-Cell RNA-Seq Data by Non-Negative Matrix Factorization.

Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Bayesian Nonparametric Clustering with Feature Selection for Spatially Resolved Transcriptomics Data

Negative Binomial Additive Model for RNA-Seq Data Analysis