Abstract:Integrated analysis of human gene expression data from multiple studies has become essential in genomics research for complex traits. However, integrating data generated from different cohorts with different platforms, such as microarray and RNA-seq, often requires data preprocessing, including normalization. In this study, we empirically evaluate 9 commonly used cross-platform normalization methods. We classify these methods into two main types: joint and separate normalization. Joint methods normalize multiple datasets together, while separate methods normalize each dataset independently. We further divide these methods into unsupervised and supervised approaches depending on whether they use outcomes during their normalization process. Examples of joint unsupervised methods include Quantile Normalization (QN), while Rank-in serves as an example of a joint supervised method. Training Distribution Matching (TDM) is an example of a separate unsupervised method. We assess each method's ability to cluster samples, predict outcomes, and detect differentially expressed (DE) genes using three real datasets and simulated data. First, our real data analysis suggests that while joint supervised methods can cluster sample groups better than the other two method groups, they double use the outcome data with artificially inflated clustering performance. Their biases are further demonstrated by their inflated type I error in DE analysis and clustering results from simulated data with no DE genes. Second, for outcome prediction, supervised normalization is no longer applicable. Among the unsupervised methods, QN significantly outperforms the other approaches, regardless of whether RNA-seq data is used to predict microarray outcomes or vice versa. Finally, we compare normalization methods on downstream DE analysis using simulation. In addition to direct DE analysis on the combined normalized data using the non-parametric Wilcoxon rank-sum test, we also perform a meta-analysis that combines p-values of the DE analysis from each individual data. For DE analysis, the meta-analysis consistently achieves the best balance between controlling type I error and maximizing power for DE gene detection. Our research suggests that while normalization is critical for the integrated analysis of transcriptomics data, simple QN is the most efficient and unbiased normalization approach for outcome prediction, and meta-analysis is the most appropriate for DE analysis.

A Unified Model for Differential Expression Analysis of RNA-seq Data Via L1-Penalized Linear Regression

A Unified Model for Joint Normalization and Differential Gene Expression Detection in RNA-Seq Data.

Joint Between-Sample Normalization and Differential Expression Detection Through ℓ0-Regularized Regression

Unit-Free and Robust Detection of Differential Expression from RNA-Seq Data

Degps is a Powerful Tool for Detecting Differential Expression in RNA-sequencing Studies

A two-step strategy for detecting differential gene expression in cDNA microarray data

Differential expression analysis for paired RNA-seq data

PDEGEM: Modeling non-uniform read distribution in RNA-Seq data

Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation

Detecting Differentially Expressed Genes by Smoothing Effect of Gene Length on Variance Estimation

A Two-Part Mixed Model for Differential Expression Analysis in Single-Cell High-Throughput Gene Expression Data.

A deep generative model for single-cell RNA sequencing with application to detecting differentially expressed genes

Evaluating Cross-Platform Normalization Methods for Integrated Microarray and RNA-seq Data Analysis

Identifying stably expressed genes from multiple RNA-Seq data sets

Dynamic Model for RNA-seq Data Analysis

PLNseq: a multivariate Poisson lognormal distribution for high-throughput matched RNA-sequencing read count data.

A balanced method detecting differentially expressed genes for RNA-sequencing data

A penalized likelihood approach for robust estimation of isoform expression

Identifying differentially expressed genes in human acute leukemia and mouse brain microarray datasets utilizing QTModel

Modeling expression ranks for noise-tolerant differential expression analysis of scRNA-seq data

A statistical normalization method and differential expression analysis for RNA-seq data between different species