Semi-supervised Cooperative Learning for Multiomics Data Fusion

Daisy Yi Ding,Xiaotao Shen,Michael Snyder,Robert Tibshirani
2023-08-03
Abstract:Multiomics data fusion integrates diverse data modalities, ranging from transcriptomics to proteomics, to gain a comprehensive understanding of biological systems and enhance predictions on outcomes of interest related to disease phenotypes and treatment responses. Cooperative learning, a recently proposed method, unifies the commonly-used fusion approaches, including early and late fusion, and offers a systematic framework for leveraging the shared underlying relationships across omics to strengthen signals. However, the challenge of acquiring large-scale labeled data remains, and there are cases where multiomics data are available but in the absence of annotated labels. To harness the potential of unlabeled multiomcis data, we introduce semi-supervised cooperative learning. By utilizing an "agreement penalty", our method incorporates the additional unlabeled data in the learning process and achieves consistently superior predictive performance on simulated data and a real multiomics study of aging. It offers an effective solution to multiomics data fusion in settings with both labeled and unlabeled data and maximizes the utility of available data resources, with the potential of significantly improving predictive models for diagnostics and therapeutics in an increasingly multiomics world.
Quantitative Methods,Genomics,Applications
What problem does this paper attempt to address?
The paper primarily addresses the issue of multiomics data fusion and proposes a new method—semi-supervised cooperative learning—to better utilize unlabeled data to enhance prediction performance in the context of limited labeled data. Specifically, the paper addresses the following key issues: 1. **Challenges of Multiomics Data Fusion**: With advancements in biotechnology, various types of "omics" data (such as genomics, transcriptomics, proteomics, etc.) can be obtained. These data provide the possibility of understanding biological systems from different perspectives. Integrating these multi-source data for analysis helps improve the accuracy of predicting disease phenotypes and treatment response outcomes. 2. **Limitations of Existing Fusion Methods**: Common multiomics data fusion methods include early fusion and late fusion, but they do not fully utilize the shared relationships between different data modalities and lack a systematic framework to enhance signal consistency. 3. **Cooperative Learning Methods**: Recently proposed cooperative learning methods introduce a "consistency penalty" term, encouraging the prediction results between different data modalities to converge, thereby enhancing prediction performance. 4. **Scarcity of Labeled Data**: In biomedical research, obtaining large-scale labeled data is very difficult and time-consuming. Therefore, effectively utilizing data without corresponding labels becomes particularly important. 5. **Semi-Supervised Cooperative Learning Method**: To address the above issues, the paper proposes a semi-supervised cooperative learning method that combines the "consistency penalty" concept from cooperative learning, effectively utilizing unlabeled data to further improve prediction accuracy. This method not only considers the prediction consistency among labeled data but also leverages the potential consistency among unlabeled data, thereby maximizing the use of all available data resources. In summary, the main contribution of this paper is the proposal of a new semi-supervised learning framework that, under conditions of limited labeled data, improves the prediction performance of multiomics data fusion by fully utilizing unlabeled data. This has significant implications for biomedical research, particularly in the fields of diagnosis and treatment.