Abstract:Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. Experimental results from both our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising potential of BMDL for effective multi-domain learning without "negative transfer" effects often seen in existing multi-task learning and transfer learning methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively utilize data from different sources or domains in cancer subtype discovery, especially when the number of data samples for a specific cancer type is small, to improve the accuracy of disease prognosis based on next - generation sequencing (NGS) count data. Specifically, the paper proposes a Bayesian Multi - Domain Learning (BMDL) model, aiming to derive domain - dependent latent representations from over - dispersed count data through a hierarchical negative binomial factorization method, so as to accurately classify cancer subtypes even when the number of samples for a specific cancer type is small. The paper mainly focuses on the following challenges: 1. **Over - dispersion of data**: NGS count data usually has the characteristic of over - dispersion and requires appropriate modeling methods to handle it. 2. **Limited sample size**: Compared with the number of participating molecules and the complexity of the system, the number of samples available for studying complex diseases (such as cancer) is often very limited, especially when considering disease heterogeneity. 3. **Cross - domain data integration**: The key question is whether all available data from different sources or domains can be integrated to achieve reproducible disease prognosis based on NGS count data. The BMDL model solves these problems in the following ways: - **Hierarchical negative binomial factorization**: Use the hierarchical negative binomial distribution to model over - dispersed count data and extract domain - dependent latent representations. - **Domain - specific and globally shared latent factors**: The model allows the extraction of domain - specific latent factors as well as globally shared latent factors, thereby improving the accuracy of cancer subtype classification when the sample size is limited. - **Avoiding negative transfer**: By automatically learning the correlations of samples in different domains, the BMDL model can avoid the "negative transfer" effect common in existing multi - task learning and transfer learning methods. The experimental results show that the BMDL model performs well on both simulated data and TCGA's NGS data sets, can effectively learn among different domains without negative impacts, and significantly improves the accuracy of cancer subtype classification, especially when the number of samples in the target domain is small.

Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data

Optimal Bayesian Transfer Learning for Count Data

Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach

Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data

Deep Subspace Mutual Learning for Cancer Subtypes Prediction

Bayesian network-driven clustering analysis with feature selection for high-dimensional multi-modal molecular data

Cancer Subtyping via Embedded Unsupervised Learning on Transcriptomics Data

Deep multi-view contrastive learning for cancer subtype identification

Classification of tumor from computed tomography images: A brain-inspired multisource transfer learning under probability distribution adaptation

Multi -View Spectral Clustering with Latent Representation Learning for Applications on Multi-Omics Cancer Subtyping

Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data

Optimal Bayesian supervised domain adaptation for RNA sequencing data

Molecular Subtyping of Cancer Based on Distinguishing Co-Expression Modules and Machine Learning

Network-based Multi-Task Learning Models for Biomarker Selection and Cancer Outcome Prediction.

BC-Predict: Mining of signal biomarkers and multilevel validation of cascade classifier for early-stage breast cancer subtyping and prognosis

A Contrastive-Learning-Based Deep Neural Network for Cancer Subtyping by Integrating Multi-Omics Data

Exploiting common patterns in diverse cancer types via multi-task learning

Multi-view contrastive clustering for cancer subtyping using fully and weakly paired multi-omics data

A Hybrid Deep Learning Model for Predicting Molecular Subtypes of Human Breast Cancer Using Multimodal Data

Bayesian Structure Learning in Multi-layered Genomic Networks