Abstract:Integrated analysis of human gene expression data from multiple studies has become essential in genomics research for complex traits. However, integrating data generated from different cohorts with different platforms, such as microarray and RNA-seq, often requires data preprocessing, including normalization. In this study, we empirically evaluate 9 commonly used cross-platform normalization methods. We classify these methods into two main types: joint and separate normalization. Joint methods normalize multiple datasets together, while separate methods normalize each dataset independently. We further divide these methods into unsupervised and supervised approaches depending on whether they use outcomes during their normalization process. Examples of joint unsupervised methods include Quantile Normalization (QN), while Rank-in serves as an example of a joint supervised method. Training Distribution Matching (TDM) is an example of a separate unsupervised method. We assess each method's ability to cluster samples, predict outcomes, and detect differentially expressed (DE) genes using three real datasets and simulated data. First, our real data analysis suggests that while joint supervised methods can cluster sample groups better than the other two method groups, they double use the outcome data with artificially inflated clustering performance. Their biases are further demonstrated by their inflated type I error in DE analysis and clustering results from simulated data with no DE genes. Second, for outcome prediction, supervised normalization is no longer applicable. Among the unsupervised methods, QN significantly outperforms the other approaches, regardless of whether RNA-seq data is used to predict microarray outcomes or vice versa. Finally, we compare normalization methods on downstream DE analysis using simulation. In addition to direct DE analysis on the combined normalized data using the non-parametric Wilcoxon rank-sum test, we also perform a meta-analysis that combines p-values of the DE analysis from each individual data. For DE analysis, the meta-analysis consistently achieves the best balance between controlling type I error and maximizing power for DE gene detection. Our research suggests that while normalization is critical for the integrated analysis of transcriptomics data, simple QN is the most efficient and unbiased normalization approach for outcome prediction, and meta-analysis is the most appropriate for DE analysis.

Investigation of normalization procedures for transcriptome profiles of compounds oriented toward practical study design

Methods to Profile the Macromolecular Targets of Small Compounds.

Normalization Methods for High-Density Oligonucleotide Microarray Data

Use of normalization methods for analysis of microarrays containing a high degree of gene effects

cDNA Microarray Experiment: Design Issues in Early Stage and the Need of Normalization

Cgcorrect: a Method to Correct for Confounding Cell-Cell Variation Due to Cell Growth in Single-Cell Transcriptomics

Evaluating Cross-Platform Normalization Methods for Integrated Microarray and RNA-seq Data Analysis

An algorithm for chemical genomic profiling that minimizes batch effects: bucket evaluations

Assessment of batch-correction methods for scRNA-seq data with a new test metric

Using RNA sample titrations to assess microarray platform performance and normalization techniques

Evaluation of normalization strategies for GC-based metabolomics

An Enrichment Method for Obtaining Biologically Significant Genes from Statistically Significant Differentially Expressed Genes in Comparative Transcriptomics

Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

Real-time transcriptomic profiling in distinct experimental conditions

Performance evaluation of transcriptomics data normalization for survival risk prediction

Microarray data normalization and transformation

Comparison and development of cross-study normalization methods for inter-species transcriptional analysis

An anchored experimental design and meta-analysis approach to address batch effects in large-scale metabolomics

A multi-platform normalization method for meta-analysis of gene expression data

A statistical normalization method and differential expression analysis for RNA-seq data between different species

Removing Batch Effects in Analysis of Expression Microarray Data: an Evaluation of Six Batch Adjustment Methods