PCA-Plus: Enhanced principal component analysis with illustrative applications to batch effects and their quantitation
Nianxiang Zhang,Tod D Casasent,Anna K Casasent,Shwetha V Kumar,Chris Wakefield,Bradley M Broom,John N Weinstein,Rehan Akbani
DOI: https://doi.org/10.1101/2024.01.02.573793
2024-01-03
bioRxiv
Abstract:Background: Principal component analysis (PCA), a standard approach to analysis and visualization of large datasets, is commonly used in biomedical research for detecting similarities and differences among groups of samples. We initially used conventional PCA as a tool for critical quality control of batch and trend effects in multi-omic profiling data produced by The Cancer Genome Atlas (TCGA) project of the NCI. We found, however, that conventional PCA visualizations were often hard to interpret when inter-batch differences were moderate in comparison with intra-batch differences; it was also difficult to quantify batch effects objectively. We, therefore, sought enhancements to make the method more informative in those and analogous settings. Results: We have developed algorithms and a toolbox of enhancements to conventional PCA that improve the detection, diagnosis, and quantitation of differences between or among groups, e.g., groups of molecularly profiled biological samples. The enhancements include (i) computed group centroids; (ii) sample-dispersion rays; (iii) differential coloring of centroids, rays, and sample data points; (iii) trend trajectories; and (iv) a novel separation index (DSC) for quantitation of differences among groups. Conclusions: PCA-Plus has been our most useful single tool for analyzing, visualizing, and quantitating batch effects, trend effects, and class differences in molecular profiling data of many types: mRNA expression, microRNA expression, DNA methylation, and DNA copy number. An early version of PCA-Plus has been used as the central graphical visualization in our MBatch package for near-real-time surveillance of data for analysis working groups in more than 70 TCGA, PanCancer Atlas, PanCancer Analysis of Whole Genomes, and Genome Data Analysis Network projects of the NCI. The algorithms and software are generic, hence applicable more generally to other types of multivariate data as well. PCA-Plus is freely available in a down-loadable R package at our MBatch website.