On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations

Markus Helmer,Shaun Warrington,Ali-Reza Mohammadi-Nejad,Jie Lisa Ji,Amber Howell,Benjamin Rosand,Alan Anticevic,Stamatios N. Sotiropoulos,John D. Murray
DOI: https://doi.org/10.1101/2020.08.25.265546
2020-08-25
Abstract:Abstract Associations between datasets, each comprising many features, can be discovered through multivariate methods like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). Application of CCA/PLS to high-dimensional datasets raises critical questions about reliability and interpretability. To study this, we developed a generative modeling framework to simulate synthetic datasets, parameterized by dimensionality, variance structure, and association strength. We found that CCA/PLS associations could be highly inaccurate when the number of samples per feature is relatively small. For PLS, profiles of feature weights exhibit detrimental bias toward leading principal component axes. We confirmed these trends in state-of-the-art neuroimaging datasets, Human Connectome Project (n ≈ 1000) and UK Biobank (n=20000), finding that only the latter comprised sufficient samples for stable estimates. Analysis of the neuroimaging literature using CCA to map brain-behavior relationships revealed also that the commonly employed sample sizes yield unstable CCA solutions. Finally, we provide a calculator of dataset properties required for CCA/PLS stability. Collectively, we characterize how to limit detrimental effects of overfitting on CCA/PLS stability, and provide recommendations for future studies.
What problem does this paper attempt to address?