Abstract:Background: Principal component analysis (PCA), a standard approach to analysis and visualization of large datasets, is commonly used in biomedical research for detecting similarities and differences among groups of samples. We initially used conventional PCA as a tool for critical quality control of batch and trend effects in multi-omic profiling data produced by The Cancer Genome Atlas (TCGA) project of the NCI. We found, however, that conventional PCA visualizations were often hard to interpret when inter-batch differences were moderate in comparison with intra-batch differences; it was also difficult to quantify batch effects objectively. We, therefore, sought enhancements to make the method more informative in those and analogous settings. Results: We have developed algorithms and a toolbox of enhancements to conventional PCA that improve the detection, diagnosis, and quantitation of differences between or among groups, e.g., groups of molecularly profiled biological samples. The enhancements include (i) computed group centroids; (ii) sample-dispersion rays; (iii) differential coloring of centroids, rays, and sample data points; (iii) trend trajectories; and (iv) a novel separation index (DSC) for quantitation of differences among groups. Conclusions: PCA-Plus has been our most useful single tool for analyzing, visualizing, and quantitating batch effects, trend effects, and class differences in molecular profiling data of many types: mRNA expression, microRNA expression, DNA methylation, and DNA copy number. An early version of PCA-Plus has been used as the central graphical visualization in our MBatch package for near-real-time surveillance of data for analysis working groups in more than 70 TCGA, PanCancer Atlas, PanCancer Analysis of Whole Genomes, and Genome Data Analysis Network projects of the NCI. The algorithms and software are generic, hence applicable more generally to other types of multivariate data as well. PCA-Plus is freely available in a down-loadable R package at our MBatch website.

VCF2PCACluster: a Simple, Fast and Memory-Efficient Tool for Principal Component Analysis of Tens of Millions of SNPs

A high-performance computing toolset for relatedness and principal component analysis of SNP data

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes

LDBlockShow: a Fast and Convenient Tool for Visualizing Linkage Disequilibrium and Haplotype Blocks Based on Variant Call Format Files

Accelerating Sparse Canonical Correlation Analysis for Large Brain Imaging Genetics Data

Power Analysis of Principal Components Regression in Genetic Association Studies.

PCA-Plus: Enhanced principal component analysis with illustrative applications to batch effects and their quantitation

SHEsisPCA: A GPU-Based Software to Correct for Population Stratification That Efficiently Accelerates the Process for Handling Genome-Wide Datasets

Establishment of a standardized system to perform population structure analyses with limited sample size or with different sets of SNP genotypes

A PCA-based method for ancestral informative markers selection in structured populations

PCA Outperforms Popular Hidden Variable Inference Methods for Molecular QTL Mapping

VariantSpark: population scale clustering of genotype information

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Principal Component Analyses in Anthropological Genetics

PAcluster: Clustering Polyadenylation Site Data Using Canonical Correlation Analysis

Population Clustering Based on Copy Number Variations Detected from Next Generation Sequencing Data.

Clustermi: Detecting High-Order Snp Interactions Based On Clustering And Mutual Information

VPAC: Variational Projection for Accurate Clustering of Single-Cell Transcriptomic Data

Power Calculation of Multi-Step Combined Principal Components with Applications to Genetic Association Studies

rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated tool for Genome-Wide Association Study

A portable clustering algorithm based on compact neighbors for face tagging