Abstract:Genomic micro-satellites are the genomic regions consisting of short and repetitive DNA motifs. Micro-satellite region usually exposes intrinsic polymorphism in terms of the numbers of repetitive motifs, which is often described as a probability distribution of the numbers of repeats. In cancer genomics, for any micro-satellite region, it is considered as a micro-satellite instability (MSI) event, if the probability distribution sampled from tumor tissue is significantly different from the distribution sampled from the corresponding normal tissue. Since recent studies have emphasized the importance of micro-satellite instability events in cancer diagnosis and treatments, a series of computational approaches have been developed to detect MSI events from the sequencing data. However, the existing methods suffer an accuracy loss when clonal micro-satellites exist, which are recently observed in some TCGA/ICGC samples. For a clonal micro-satellite, different sub-clones may carry different distributions, while the observed “distribution” from the sequencing data is actually a convolution of the sub-clonal ones. In this case, a sub-clonal distribution may present a true MSI event, but the convolutional one dilutes the data signal and misleads the detection algorithm to report a micro-satellite stability (MSS) event, which introduces type-I error. In addition, a comprehensive understanding of the micro-satellite distribution of each sub-clone is also quite informative for downstream analyses. Thus, to overcome the potential weakness of existing approaches and further improve the computational model, here, we proposed a probabilistic framework, named CMSI, to identify the MSI events under tumor heterogeneous structure. Similar to other approaches, CMSI works on the next generation sequencing data. The proposed framework follows the assumption that the probability density function of the numbers of repeats of a micro-satellite region usually follows a normal distribution. Then, when clonal micro-satellite exists, the convolution distribution observed from the sequencing data should obey a Gaussian mixture distribution. CMSI establishes a variational Bayesian mixture model for the Gaussian distribution calculated from the sequencing reads. This mixture model clusters the reads by the numbers of repeats they bring or infer, and further provide a probabilistic assignment to each read by maximizing the global posterior distribution. By solving this computational model by an EM algorithm, CMSI estimates the number of sub-clones, the proportion of each sub-clone and the parameters of each distribution. Finally, each sub-clonal distribution is examined by statistical test by weighting the clonal proportion, and CMSI outputs the MSI events of sub-clones. To verify the performance of the proposed framework, we conducted several experiments on both simulation datasets and real datasets, where CMSI effectively identified an acceptable percentage of the preset MSI events. Note: This abstract was not presented at the meeting. Citation Format: Yixuan Wang, Xuanping Zhang, Yi Huang, Tao Liu, Xiao Xiao, Jiayin Wang. CMSI: A Bayesian model for estimating clonal micro-satellites instability from NGS data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr LB-215.

A Parametric Model for Clustering Single-Cell Mutation Data

DMCM: a Data-adaptive Mutation Clustering Method to Identify Cancer-Related Mutation Clusters.

bmVAE: a variational autoencoder method for clustering single-cell mutation data

SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data

AMC: accurate mutation clustering from single-cell DNA sequencing data

SCClone: Accurate Clustering of Tumor Single-Cell DNA Sequencing Data

PairClone: A Bayesian Subclone Caller Based on Mutation Pairs

A new correlation clustering method for cancer mutation analysis

BREM-SC: A Bayesian Random Effects Mixture Model for Joint Clustering Single Cell Multi-omics Data

A Bayesian framework to study tumor subclone-specific expression by combining bulk DNA and single-cell RNA sequencing data

A probabilistic model-based bi-clustering method for single-cell transcriptomic data analysis

Improving personalized prediction of cancer prognoses with clonal evolution models

A Bayesian method to cluster single-cell RNA sequencing data using Copy Number Alterations

CMSI: A Bayesian model for estimating clonal micro-satellites instability from NGS data

Protocol for Analyzing Functional Gene Module Perturbation During the Progression of Diseases Using a Single-Cell Bayesian Biclustering Framework

A Nonparametric Bayesian Approach for Clustering Bisulfate-Based DNA Methylation Profiles

GRMT: Generative Reconstruction of Mutation Tree from Scratch Using Single-Cell Sequencing Data

Accurate Estimation of Genomic Deletions and Normal Cell Contamination by Bayesian Analysis of Mixtures

A Clustering Approach to Integrative Analysis of Multiomic Cancer Data

Optimize Deep Learning Models for Prediction of Gene Mutations Using Unsupervised Clustering

Phylogeny-based tumor subclone identification using a Bayesian feature allocation model