An Efficient and Principled Model to Jointly Learn the Agnostic and Multifactorial Effect in Large-Scale Biological Data
Zuolin Cheng,Songtao Wei,Yinxue Wang,Yizhi Wang,Q Richard Lu,Yue Wang,Guoqiang Yu
DOI: https://doi.org/10.1101/2024.04.12.589306
2024-04-15
Abstract:The rich information contained in biological data is often distorted by multiple interacting intrinsic or extrinsic factors. Modeling the effects of these factors is necessary to uncover the underlying true signals. However, this is challenging in real applications, because biological data usually consist of tens of thousands or millions of factors, and no reliable prior knowledge is available on how these factors exert the effect, to what degree the effect is, as well as how they interact with each other. Thus, the existing approaches rely on excessive simplification or unrealistic assumptions such as the probabilistic independence among factors. In this paper, we report the finding that after reformulating the data as a contingency tensor the problem can be well addressed by a fundamental machine learning principle, Maximum Entropy, with an extra effort of designing an efficient algorithm to solve the large-scale optimization problem. Based on the principle of maximum entropy, and by constraining the marginals of the contingency tensor using the observed values, our Conditional Multifactorial Contingency (CMC) model imposes minimum but essential assumptions about the multifactorial joint effects and leads to a conceptually simple distribution, which informs how these factors exert the effects and interact with each other. By replacing hard constraints with expected values, CMC avoids the NP-hard problem and results in a theoretically solvable convex problem. However, due to the large scale of variables and constraints, the standard convex solvers do not work. Exploring the special properties of the CMC model we developed an efficient iterative optimizer, which reduces the running time from infeasible to minutes or from days to seconds. We applied CMC to quite a few cutting-edge biological applications, including the detection of driving transcription factor, scRNA-seq normalization, cancer-associated gene identification, GO-term activity transformation, and quantification of single-cell-level similarity. CMC gained much better performance than other methods with respect to various evaluation criteria. Our source code of CMC as well as its example applications can be found at .
Bioinformatics