Abstract:Integrated analysis of human gene expression data from multiple studies has become essential in genomics research for complex traits. However, integrating data generated from different cohorts with different platforms, such as microarray and RNA-seq, often requires data preprocessing, including normalization. In this study, we empirically evaluate 9 commonly used cross-platform normalization methods. We classify these methods into two main types: joint and separate normalization. Joint methods normalize multiple datasets together, while separate methods normalize each dataset independently. We further divide these methods into unsupervised and supervised approaches depending on whether they use outcomes during their normalization process. Examples of joint unsupervised methods include Quantile Normalization (QN), while Rank-in serves as an example of a joint supervised method. Training Distribution Matching (TDM) is an example of a separate unsupervised method. We assess each method's ability to cluster samples, predict outcomes, and detect differentially expressed (DE) genes using three real datasets and simulated data. First, our real data analysis suggests that while joint supervised methods can cluster sample groups better than the other two method groups, they double use the outcome data with artificially inflated clustering performance. Their biases are further demonstrated by their inflated type I error in DE analysis and clustering results from simulated data with no DE genes. Second, for outcome prediction, supervised normalization is no longer applicable. Among the unsupervised methods, QN significantly outperforms the other approaches, regardless of whether RNA-seq data is used to predict microarray outcomes or vice versa. Finally, we compare normalization methods on downstream DE analysis using simulation. In addition to direct DE analysis on the combined normalized data using the non-parametric Wilcoxon rank-sum test, we also perform a meta-analysis that combines p-values of the DE analysis from each individual data. For DE analysis, the meta-analysis consistently achieves the best balance between controlling type I error and maximizing power for DE gene detection. Our research suggests that while normalization is critical for the integrated analysis of transcriptomics data, simple QN is the most efficient and unbiased normalization approach for outcome prediction, and meta-analysis is the most appropriate for DE analysis.

A Novel Bioinformatics Approach to Identify the Consistently Well-Performing Normalization Strategy for Current Metabolomic Studies

Influences of Normalization Method on Biomarker Discovery in Gas Chromatography-Mass Spectrometry-Based Untargeted Metabolomics: What Should Be Considered?

Performance Evaluation and Online Realization of Data-driven Normalization Methods Used in LC/MS based Untargeted Metabolomics Analysis

Norm ISWSVR: A Data Integration and Normalization Approach for Large-Scale Metabolomics

MetaboGroup S: A Group Entropy-Based Web Platform for Evaluating Normalization Methods in Blood Metabolomics Data from Maintenance Hemodialysis Patients.

Data normalization strategies in metabolomics: Current challenges, approaches, and tools

Group Aggregating Normalization Method for the Preprocessing of NMR-based Metabolomic Data

Comparing normalization methods and the impact of noise

Normalization Approach by a Reference Material to Improve LC-MS-Based Metabolomic Data Comparability of Multibatch Samples.

NMR Based Metabonomic Data Preprocessing

Evaluation of normalization strategies for GC-based metabolomics

Normalization and integration of large-scale metabolomics data using support vector regression

Evaluation of Normalization Methods for Analysis of LC-MS Data

Development and Validation of an Improved Probabilistic Quotient Normalization Method for LC/MS- and NMR-based Metabonomic Analysis

Analytical challenges of untargeted GC-MS-based metabolomics and the critical issues in selecting the data processing strategy

Metabolomic analysis of urine samples by UHPLC-QTOF-MS: Impact of normalization strategies

Optimal Normalization Method for GC-MS/MS-Based Large-Scale Targeted Metabolomics

Combination of Injection Volume Calibration by Creatinine and Ms Signals' Normalization to Overcome Urine Variability in Lc-Ms-Based Metabolomics Studies

Normalization Method Utilizing Endogenous Proteins for Quantitative Proteomics

Evaluating Cross-Platform Normalization Methods for Integrated Microarray and RNA-seq Data Analysis

RobNorm: Model-Based Robust Normalization Method for Labeled Quantitative Mass Spectrometry Proteomics Data.