Feature-specific quantile normalization and feature-specific mean–variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data

Daniel Skubleny,Sunita Ghosh,Jennifer Spratlin,Daniel E. Schiller,Gina R. Rayat
DOI: https://doi.org/10.1186/s12859-024-05759-w
IF: 3.307
2024-03-31
BMC Bioinformatics
Abstract:Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The paper primarily explores the performance of two methods for cross-platform (microarray and RNA sequencing) gene expression data normalization—Feature-Specific Quantile Normalization (FSQN) and Feature-Specific Mean-Variance Normalization (FSMVN)—in supervised machine learning classification tasks. ### Research Background and Objectives - **Research Background**: In molecular classification, using gene expression data for disease research, treatment, and classification is a powerful framework. In cancer research, molecular classification helps in understanding tumor heterogeneity, disease mechanisms, progression, and prognosis. However, comparing data between different technological platforms (such as microarray and RNA sequencing) presents the issue of technical bias. - **Research Objectives**: - Compare the performance of FSQN and FSMVN in bidirectional normalization (i.e., microarray to RNA sequencing or RNA sequencing to microarray) and evaluate the effectiveness of these methods under feature selection techniques. - Verify whether FSQN and FSMVN can maintain equivalent classification performance during the bidirectional normalization process and whether this performance is affected by feature selection. ### Main Findings - **Elimination of Batch Effects**: FSQN and FSMVN can effectively eliminate batch effects between data from different technological platforms. - **Classification Performance**: Without using feature selection, FSQN and FSMVN provided clinically equivalent bidirectional model performance comparable to internal platform distribution. Even under optimal feature selection conditions, FSQN and FSMVN exhibited balanced accuracy comparable to internal platform distribution performance. - **Impact of Feature Selection**: When using feature selection, FSQN and FSMVN still maintained good performance, and as the number of selected genes decreased, the performance of these two methods remained close to the scenario of using single-platform data. ### Conclusion - FSQN and FSMVN are equally effective in generating supervised machine learning classifiers for molecular subtype classification. - Under optimal modeling conditions, the model accuracy on cross-platform normalized data using these two methods is comparable to that of single-platform data. - Caution is still needed when using cross-platform data, as specific performance differences may depend on the classification problem, training, and testing distributions, among other factors.