Abstract:Genes are often regulated in living cells by proteins called transcription factors that bind directly to short segments of DNA in close proximity to specific genes. These binding sites have a conserved nucleotide appearance, which is called a motif. Several recent studies of transcriptional regulation require the reduction of a large collection of motifs into clusters based on the similarity of their nucleotide composition. We present a principled approach to this clustering problem based on a Bayesian hierarchical model that accounts for both within- and between-motif variability. We use a Dirichlet process prior distribution that allows the number of clusters to vary and we also present a novel generalization that allows the core width of each motif to vary. This clustering model is implemented, using a Gibbs sampling strategy, on several collections of transcription factor motif matrices. Our stochastic implementation allows us to examine the variability of our results in addition to focusing on a set of best clusters. Our clustering results identify several motif clusters that suggest that several transcription factor protein families are actually mixtures of several smaller groups of highly similar motifs, which provide substantially more refined information compared with the full set of motifs in the family. Our clusters provide a means by which to organize transcription factors based on binding motif similarities and can be used to reduce motif redundancy within large databases such as JASPAR and TRANSFAC, which aides the use of these databases for further motif discovery. Finally, our clustering procedure has been used in combination with discovery of evolutionarily conserved motifs to predict co-regulated genes. An alternative to our Dirichlet process prior distribution is presented that differs substantially in terms of a priori clustering characteristics, but shows no substantive difference in the clustering results for our dataset. Despite our specific application to transcription factor binding motifs, our Bayesian clustering model based on the Dirichlet process has several advantages over traditional clustering methods that could make our procedure appropriate and useful for many clustering applications.

A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling

Understanding Distal Transcriptional Regulation from Sequence Motif, Network Inference and Interactome Perspectives

An Improved Algorithm On Detecting Transcription And Translation Motif In Archaeal Genomic Sequences

A Suite of Web-Based Programs to Search for Transcriptional Regulatory Motifs

An Integrative and Applicable Phylogenetic Footprinting Framework for Cis-Regulatory Motifs Identification in Prokaryotic Genomes

Correlating overrepresented upstream motifs to gene expression: a computational approach to regulatory element discovery in eukaryotes

Regulatory Element Detection Using a Probabilistic Segmentation Model

Image-based Promoter Prediction: a Promoter Prediction Method Based on Evolutionarily Generated Patterns.

MotifHub: Detection of trans-acting DNA motif group with probabilistic modeling algorithm

Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data

Multiple Motif Discovery in Biological Sequences by Mixture Gibbs Sampling

Tuning promoter boundaries improves regulatory motif discovery in nonmodel plants: the peach example

The regulatory grammar of human promoters uncovered by MPRA-trained deep learning

Analysis of a Gibbs sampler method for model based clustering of gene expression data

A Mutation Degree Model for the Identification of Transcriptional Regulatory Elements

Bayesian Clustering of Transcription Factor Binding Motifs

WASABI: a dynamic iterative framework for gene regulatory network inference

Identifying targets of multiple co-regulating transcription factors from expression time-series by Bayesian model comparison

A computational approach to regulatory element discovery in eukaryotes

Recognition of prokaryotic promoters based on a novel variable-window Z-curve method

GPMiner: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group