Bayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data

Ehsan Hajiramezanali,Siamak Zamani Dadaneh,Alireza Karbalayghareh,Mingyuan Zhou,Xiaoning Qian
DOI: https://doi.org/10.48550/arXiv.1810.09433
2018-10-23
Abstract:Precision medicine aims for personalized prognosis and therapeutics by utilizing recent genome-scale high-throughput profiling techniques, including next-generation sequencing (NGS). However, translating NGS data faces several challenges. First, NGS count data are often overdispersed, requiring appropriate modeling. Second, compared to the number of involved molecules and system complexity, the number of available samples for studying complex disease, such as cancer, is often limited, especially considering disease heterogeneity. The key question is whether we may integrate available data from all different sources or domains to achieve reproducible disease prognosis based on NGS count data. In this paper, we develop a Bayesian Multi-Domain Learning (BMDL) model that derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small. Experimental results from both our simulated and NGS datasets from The Cancer Genome Atlas (TCGA) demonstrate the promising potential of BMDL for effective multi-domain learning without "negative transfer" effects often seen in existing multi-task learning and transfer learning methods.
Machine Learning,Genomics,Applications
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize data from different sources or domains in cancer subtype discovery, especially when the number of data samples for a specific cancer type is small, to improve the accuracy of disease prognosis based on next - generation sequencing (NGS) count data. Specifically, the paper proposes a Bayesian Multi - Domain Learning (BMDL) model, aiming to derive domain - dependent latent representations from over - dispersed count data through a hierarchical negative binomial factorization method, so as to accurately classify cancer subtypes even when the number of samples for a specific cancer type is small. The paper mainly focuses on the following challenges: 1. **Over - dispersion of data**: NGS count data usually has the characteristic of over - dispersion and requires appropriate modeling methods to handle it. 2. **Limited sample size**: Compared with the number of participating molecules and the complexity of the system, the number of samples available for studying complex diseases (such as cancer) is often very limited, especially when considering disease heterogeneity. 3. **Cross - domain data integration**: The key question is whether all available data from different sources or domains can be integrated to achieve reproducible disease prognosis based on NGS count data. The BMDL model solves these problems in the following ways: - **Hierarchical negative binomial factorization**: Use the hierarchical negative binomial distribution to model over - dispersed count data and extract domain - dependent latent representations. - **Domain - specific and globally shared latent factors**: The model allows the extraction of domain - specific latent factors as well as globally shared latent factors, thereby improving the accuracy of cancer subtype classification when the sample size is limited. - **Avoiding negative transfer**: By automatically learning the correlations of samples in different domains, the BMDL model can avoid the "negative transfer" effect common in existing multi - task learning and transfer learning methods. The experimental results show that the BMDL model performs well on both simulated data and TCGA's NGS data sets, can effectively learn among different domains without negative impacts, and significantly improves the accuracy of cancer subtype classification, especially when the number of samples in the target domain is small.