Learning Massive-scale Partial Correlation Networks in Clinical Multi-omics Studies with HP-ACCORD

Sungdong Lee,Joshua Bang,Youngrae Kim,Hyungwon Choi,Sang-Yun Oh,Joong-Ho Won
2024-12-16
Abstract:Graphical model estimation from modern multi-omics data requires a balance between statistical estimation performance and computational scalability. We introduce a novel pseudolikelihood-based graphical model framework that reparameterizes the target precision matrix while preserving sparsity pattern and estimates it by minimizing an $\ell_1$-penalized empirical risk based on a new loss function. The proposed estimator maintains estimation and selection consistency in various metrics under high-dimensional assumptions. The associated optimization problem allows for a provably fast computation algorithm using a novel operator-splitting approach and communication-avoiding distributed matrix multiplication. A high-performance computing implementation of our framework was tested in simulated data with up to one million variables demonstrating complex dependency structures akin to biological networks. Leveraging this scalability, we estimated partial correlation network from a dual-omic liver cancer data set. The co-expression network estimated from the ultrahigh-dimensional data showed superior specificity in prioritizing key transcription factors and co-activators by excluding the impact of epigenomic regulation, demonstrating the value of computational scalability in multi-omic data analysis. %derived from the gene expression data.
Machine Learning,Statistics Theory
What problem does this paper attempt to address?
This paper attempts to address the balance between statistical estimation performance and computational scalability when constructing large - scale partial correlation networks in clinical multi - omics studies. Specifically, the article points out that current methods (such as Graphical Lasso, CLIME, etc.) have computational bottlenecks when dealing with modern multi - omics data and cannot effectively handle extremely large - scale datasets. To solve this problem, the authors propose a new pseudo - likelihood graph model framework - ACCORD. This framework estimates the precision matrix by re - parameterizing the target precision matrix and minimizing the ℓ1 - regularized empirical risk based on a new loss function. The ACCORD framework not only maintains the consistency of estimation and selection but also achieves a fast and scalable optimization algorithm through a novel operation - splitting method and communication - avoiding distributed matrix multiplication. The following is a summary of the key problems that the paper attempts to solve: 1. **Computational Scalability**: Existing methods face computational bottlenecks when dealing with extremely large - scale datasets and cannot complete the computation within a reasonable time. For example, in two high - performance computing environments, the most scalable Graphical Lasso implementation, BigQUIC, can only complete the computation in simple cases, and CLIME cannot run on HPC machines when there are more than 30,000 variables. 2. **Statistical Estimation Performance**: Existing high - dimensional precision matrix estimation methods (such as Graphical Lasso) perform well on small - scale datasets but are difficult to ensure the accuracy and consistency of the estimation in extremely large - scale datasets. In addition, pseudo - likelihood methods such as CONCORD have good computational performance but lack statistical consistency. 3. **Limitations of Feature Screening**: To improve computational efficiency, some methods adopt a feature screening step, but this may lead to the exclusion of important molecules, thus affecting the biological interpretation of the results. 4. **Estimation of Conditional Dependence Relationships**: In multi - omics data analysis, marginal correlations may not be the result of direct biological interactions but are caused by shared regulatory factors. Therefore, estimating conditional dependence relationships (such as partial correlations) is crucial for revealing the real biological network structure. To solve the above problems, the authors propose the ACCORD framework, which achieves a balance between computational scalability and statistical performance in the following ways: - **Re - parameterizing the Precision Matrix**: ACCORD enables the optimization problem to be efficiently solved on large - scale datasets by re - parameterizing the target precision matrix and introducing a new loss function. - **Fast Optimization Algorithm**: ACCORD utilizes an operation - splitting method and communication - avoiding distributed matrix multiplication to achieve an efficient optimization algorithm. - **Statistical Consistency**: The ACCORD estimator is consistent under standard high - dimensional assumptions and has selection consistency under non - representable conditions. Finally, the authors demonstrate the application of HP - ACCORD in simulated data and actual liver cancer multi - omics data, proving its effectiveness and superiority in handling extremely large - scale datasets.