QUBIC2: A novel biclustering algorithm for large-scale bulk RNA-sequencing and single-cell RNA-sequencing data analysis

Juan Xie,Anjun Ma,Yu Zhang,Bingqiang Liu,Changlin Wan,Sha Cao,Chi Zhang,Qin Ma
DOI: https://doi.org/10.1101/409961
2018-09-07
Abstract:ABSTRACT The combination of biclustering and large-scale gene expression data holds a promising potential for inference of the condition specific functional pathways/networks. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-sequencing (RNA-Seq) data, majorly due to the lack of ( i ) a consideration of high sparsity of RNA-Seq data, e.g., the massive zeros or lowly expressed genes in the data, especially for single-cell RNA-Seq (scRNA-Seq) data, and ( ii ) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. Here we presented a novel biclustering algorithm namely QUBIC2, for the analysis of large-scale bulk RNA-Seq and scRNA-Seq data. Key novelties of the algorithm include ( i ) used a truncated model to handle the unreliable quantification of genes with low or moderate expression, ( ii ) adopted the mixture Gaussian distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes, ( iii ) utilized a Core-Dual strategy to identify biclusters and optimize relevant parameters, and ( iv ) developed a size-based P -value framework to evaluate the statistical significances of all the identified biclusters. Our method validation on comprehensive data sets of bulk and single cell RNA-seq data suggests that QUBIC2 had superior performance in functional modules detection and cell type classification compared with the other five widely-used biclustering tools. In addition, the applications of temporal and spatial data demonstrated that QUBIC2 can derive meaningful biological information from scRNA-Seq data. The source code for QUBIC2 can be freely accessed at https://github.com/maqin2001/qubic2 .
What problem does this paper attempt to address?