Abstract:Integrative approaches that simultaneously model multi-omics data have gained increasing popularity because they provide holistic system biology views of multiple or all components in a biological system of interest. Canonical correlation analysis (CCA) is a correlation-based integrative method designed to extract latent features shared between multiple assays by finding the linear combinations of features–referred to as canonical variables (CVs)–within each assay that achieve maximal across-assay correlation. Although widely acknowledged as a powerful approach for multi-omics data, CCA has not been systematically applied to multi-omics data in large cohort studies, which has only recently become available. Here, we adapted sparse multiple CCA (SMCCA), a widely-used derivative of CCA, to proteomics and methylomics data from the Multi-Ethnic Study of Atherosclerosis (MESA) and Jackson Heart Study (JHS). To tackle challenges encountered when applying SMCCA to MESA and JHS, our adaptations include the incorporation of the Gram-Schmidt (GS) algorithm with SMCCA to improve orthogonality among CVs, and the development of Sparse Supervised Multiple CCA (SSMCCA) to allow supervised integration analysis for more than two assays. Effective application of SMCCA to the two real datasets reveals important findings. Applying our SMCCA-GS to MESA and JHS, we identified strong associations between blood cell counts and protein abundance, suggesting that adjustment of blood cell composition should be considered in protein-based association studies. Importantly, CVs obtained from two independent cohorts also demonstrate transferability across the cohorts. For example, proteomic CVs learned from JHS, when transferred to MESA, explain similar amounts of blood cell count phenotypic variance in MESA, explaining 39.0% ~ 50.0% variation in JHS and 38.9% ~ 49.1% in MESA. Similar transferability was observed for other omics-CV-trait pairs. This suggests that biologically meaningful and cohort-agnostic variation is captured by CVs. We anticipate that applying our SMCCA-GS and SSMCCA on various cohorts would help identify cohort-agnostic biologically meaningful relationships between multi-omics data and phenotypic traits. Comprehensive understanding of human complex traits may benefit from incorporation of molecular features from multiple biological layers such as genome, epigenome, transcriptome, proteome, and metabolome. CCA is a correlation-based method for multi-omics data which reduces the dimension of each omic assay to several orthogonal components–commonly referred to as canonical variables (CVs). The widely-used SMCCA method allows effective dimension reduction and integration of multi-omics data, but suffers from potentially highly correlated CVs when applied to high-dimensional omics data. Here, we improve the statistical independence among the CVs by adopting a variation of the GS algorithm. We applied our SMCCA-GS method to proteomic and methylomic data from two cohort studies, MESA and JHS. Our results reveal a pronounced effect of blood cell counts on protein abundance, suggesting blood cell composition adjustment in protein-based association studies may be necessary. Finally, we present SSMCCA which allows supervised CCA analysis for the association between one phenotype of interest and more than two assays. We anticipate that SMCCA-GS would help reveal meaningful system-level factors from biological processes involving features from multiple assays; and SSMCCA would further empower interrogation of these factors for phenotypic traits related to health and diseases.

Complete canonical correlation analysis for multi-omic molecular subtyping of colorectal cancer

Multi-Omics Data Fusion for Cancer Molecular Subtyping Using Sparse Canonical Correlation Analysis

Multiomics-Based Colorectal Cancer Molecular Subtyping Using Local Scaling Network Fusion

Benchmarking multi-omics integrative clustering methods for subtype identification in colorectal cancer

Comprehensive characterization of tumor microenvironment in colorectal cancer via molecular analysis

Multi-omics characterization of cholangiocarcinoma and association with prognostic and therapeutic molecular subtypes.

Multi-omics cluster defines the subtypes of CRC with distinct prognosis and tumor microenvironment

Multi-omics Analysis Classifies Colorectal Cancer into Distinct Methylated Immunogenic and Angiogenic Subtypes Based on Anatomical Laterality

Abstract 409: Deciphering chromosomal instability in consensus molecular subtypes (CMS) in CRC: Insights from an integrative multi-omics approach

Classification of Colorectal Cancer Consensus Molecular Subtypes Using Attention-Based Multi-Instance Learning Network on Whole-Slide Images

Subtype Identification from Heterogeneous TCGA Datasets on a Genomic Scale by Multi-View Clustering with Enhanced Consensus.

Metabolism-Associated Molecular Classification of Colorectal Cancer

MDICC: novel method for multi-omics data integration and cancer subtype identification

Identification of immunotherapy and chemotherapy-related molecular subtypes in colon cancer by integrated multi-omics data analysis

Sparse canonical correlation analysis applied to ‐omics studies for integrative analysis and biomarker discovery

Canonical correlation analysis for multi-omics: Application to cross-cohort analysis

A novel molecular subtyping based on multi-omics analysis for prognosis predicting in colorectal melanoma: A 16-year prospective multicentric study

Correlating Cellular Features with Gene Expression using CCA

Multi-view contrastive clustering for cancer subtyping using fully and weakly paired multi-omics data

A transcriptome based molecular classification scheme for cholangiocarcinoma and subtype-derived prognostic biomarker

MRGCN: cancer subtyping with multi-reconstruction graph convolutional network using full and partial multi-omics dataset