Abstract:Graphical model estimation from modern multi-omics data requires a balance between statistical estimation performance and computational scalability. We introduce a novel pseudolikelihood-based graphical model framework that reparameterizes the target precision matrix while preserving sparsity pattern and estimates it by minimizing an $\ell_1$-penalized empirical risk based on a new loss function. The proposed estimator maintains estimation and selection consistency in various metrics under high-dimensional assumptions. The associated optimization problem allows for a provably fast computation algorithm using a novel operator-splitting approach and communication-avoiding distributed matrix multiplication. A high-performance computing implementation of our framework was tested in simulated data with up to one million variables demonstrating complex dependency structures akin to biological networks. Leveraging this scalability, we estimated partial correlation network from a dual-omic liver cancer data set. The co-expression network estimated from the ultrahigh-dimensional data showed superior specificity in prioritizing key transcription factors and co-activators by excluding the impact of epigenomic regulation, demonstrating the value of computational scalability in multi-omic data analysis. %derived from the gene expression data.

What problem does this paper attempt to address?

This paper attempts to address the balance between statistical estimation performance and computational scalability when constructing large - scale partial correlation networks in clinical multi - omics studies. Specifically, the article points out that current methods (such as Graphical Lasso, CLIME, etc.) have computational bottlenecks when dealing with modern multi - omics data and cannot effectively handle extremely large - scale datasets. To solve this problem, the authors propose a new pseudo - likelihood graph model framework - ACCORD. This framework estimates the precision matrix by re - parameterizing the target precision matrix and minimizing the ℓ1 - regularized empirical risk based on a new loss function. The ACCORD framework not only maintains the consistency of estimation and selection but also achieves a fast and scalable optimization algorithm through a novel operation - splitting method and communication - avoiding distributed matrix multiplication. The following is a summary of the key problems that the paper attempts to solve: 1. **Computational Scalability**: Existing methods face computational bottlenecks when dealing with extremely large - scale datasets and cannot complete the computation within a reasonable time. For example, in two high - performance computing environments, the most scalable Graphical Lasso implementation, BigQUIC, can only complete the computation in simple cases, and CLIME cannot run on HPC machines when there are more than 30,000 variables. 2. **Statistical Estimation Performance**: Existing high - dimensional precision matrix estimation methods (such as Graphical Lasso) perform well on small - scale datasets but are difficult to ensure the accuracy and consistency of the estimation in extremely large - scale datasets. In addition, pseudo - likelihood methods such as CONCORD have good computational performance but lack statistical consistency. 3. **Limitations of Feature Screening**: To improve computational efficiency, some methods adopt a feature screening step, but this may lead to the exclusion of important molecules, thus affecting the biological interpretation of the results. 4. **Estimation of Conditional Dependence Relationships**: In multi - omics data analysis, marginal correlations may not be the result of direct biological interactions but are caused by shared regulatory factors. Therefore, estimating conditional dependence relationships (such as partial correlations) is crucial for revealing the real biological network structure. To solve the above problems, the authors propose the ACCORD framework, which achieves a balance between computational scalability and statistical performance in the following ways: - **Re - parameterizing the Precision Matrix**: ACCORD enables the optimization problem to be efficiently solved on large - scale datasets by re - parameterizing the target precision matrix and introducing a new loss function. - **Fast Optimization Algorithm**: ACCORD utilizes an operation - splitting method and communication - avoiding distributed matrix multiplication to achieve an efficient optimization algorithm. - **Statistical Consistency**: The ACCORD estimator is consistent under standard high - dimensional assumptions and has selection consistency under non - representable conditions. Finally, the authors demonstrate the application of HP - ACCORD in simulated data and actual liver cancer multi - omics data, proving its effectiveness and superiority in handling extremely large - scale datasets.

Learning Massive-scale Partial Correlation Networks in Clinical Multi-omics Studies with HP-ACCORD

A Generalized Higher-order Correlation Analysis Framework for Multi-Omics Network Inference

MaxCorrMGNN: A Multi-Graph Neural Network Framework for Generalized Multimodal Fusion of Medical Data for Outcome Prediction

Bayesian estimation for longitudinal data in a joint model with HPCs

SDGCCA: Supervised Deep Generalized Canonical Correlation Analysis for Multi-omics Integration

Integration of Multi-Omics Data for Gene Regulatory Network Inference and Application to Breast Cancer

An Efficient and Principled Model to Jointly Learn the Agnostic and Multifactorial Effect in Large-Scale Biological Data

Biological network inference using low order partial correlation

Covariance Assisted Multivariate Penalized Additive Regression (CoMPAdRe)

Nonparametric Covariance Regression for Massive Neural Data on Restricted Covariates via Graph

MMGCN: Multi-modal multi-view graph convolutional networks for cancer prognosis prediction

Joint network and node selection for pathway-based genomic data analysis.

Prediction of disease-free survival for precision medicine using cooperative learning on multi-omic data

High-Dimensional Joint Estimation of Multiple Directed Gaussian Graphical Models

NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA.

Robust Multi-view Co-expression Network Inference

Unsupervised discovery of phenotype-specific multi-omics networks

SNeCT: Scalable network constrained Tucker decomposition for integrative multi-platform data analysis

Multi-level attention graph neural network based on co-expression gene modules for disease diagnosis and prognosis

The joint graphical lasso for inverse covariance estimation across multiple classes

Modelling-based joint embedding of histology and genomics using canonical correlation analysis for breast cancer survival prediction