Abstract:Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view's data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on the L2 space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.

Discovering Structure in High-Dimensional Data Through Correlation Explanation

Finding High-Order Correlations In High-Dimensional Biological Data

Efficient Covariance Estimation from Temporal Data

Exploring higher-order neural network node interactions with total correlation

An Efficient Algorithm for Information Decomposition and Extraction.

Inferring Local Structure from Pairwise Correlations

Auto-Encoding Total Correlation Explanation

ExClus: Explainable Clustering on Low-dimensional Data Representations

CARE: Finding Local Linear Correlations in High Dimensional Data

Intrinsic Dimension Correlation: uncovering nonlinear connections in multimodal representations

Correlations reveal the hierarchical organization of biological networks with latent variables

Identifying the Complete Correlation Structure in Large-Scale High-Dimensional Data Sets with Local False Discovery Rates

Correlated Components Analysis - Extracting Reliable Dimensions in Multivariate Data

Discovering and Deciphering Relationships Across Disparate Data Modalities

Unsupervised detection of semantic correlations in big data

Manifold-based Shapley explanations for high dimensional correlated features

A New Approach to Discover Interlacing Data Structures in High-Dimensional Space

Telling cause from effect based on high-dimensional observations

Big Data Scaling through Metric Mapping: Exploiting the Remarkable Simplicity of Very High Dimensional Spaces using Correspondence Analysis

D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data

Discovering Support and Affiliated Features from Very High Dimensions