Evaluating Cross-Platform Normalization Methods for Integrated Microarray and RNA-seq Data Analysis
Xuejun Sun,Yu Zhang,Chuwen Liu,Xiaojing Zheng,Fei Zou
DOI: https://doi.org/10.1101/2024.09.30.615938
2024-10-02
Abstract:Integrated analysis of human gene expression data from multiple studies has become essential in genomics research for complex traits. However, integrating data generated from different cohorts with different platforms, such as microarray and RNA-seq, often requires data preprocessing, including normalization. In this study, we empirically evaluate 9 commonly used cross-platform normalization methods. We classify these methods into two main types: joint and separate normalization. Joint methods normalize multiple datasets together, while separate methods normalize each dataset independently. We further divide these methods into unsupervised and supervised approaches depending on whether they use outcomes during their normalization process. Examples of joint unsupervised methods include Quantile Normalization (QN), while Rank-in serves as an example of a joint supervised method. Training Distribution Matching (TDM) is an example of a separate unsupervised method. We assess each method's ability to cluster samples, predict outcomes, and detect differentially expressed (DE) genes using three real datasets and simulated data. First, our real data analysis suggests that while joint supervised methods can cluster sample groups better than the other two method groups, they double use the outcome data with artificially inflated clustering performance. Their biases are further demonstrated by their inflated type I error in DE analysis and clustering results from simulated data with no DE genes. Second, for outcome prediction, supervised normalization is no longer applicable. Among the unsupervised methods, QN significantly outperforms the other approaches, regardless of whether RNA-seq data is used to predict microarray outcomes or vice versa. Finally, we compare normalization methods on downstream DE analysis using simulation. In addition to direct DE analysis on the combined normalized data using the non-parametric Wilcoxon rank-sum test, we also perform a meta-analysis that combines p-values of the DE analysis from each individual data. For DE analysis, the meta-analysis consistently achieves the best balance between controlling type I error and maximizing power for DE gene detection. Our research suggests that while normalization is critical for the integrated analysis of transcriptomics data, simple QN is the most efficient and unbiased normalization approach for outcome prediction, and meta-analysis is the most appropriate for DE analysis.
Genetics