Abstract:Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.

Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites

Using sequence-specific chemical and structural properties of DNA to predict transcription factor binding sites

An efficient algorithm for improving structure-based prediction of transcription factor binding sites

Predicting Tissue Specific Transcription Factor Binding Sites

Predicting the DNA binding specificity of mutated transcription factors using family-level biophysically interpretable machine learning

The Next Generation of Transcription Factor Binding Site Prediction

Advances on Bioinformatic Research in Transcription Factor Binding Sites

Understanding Transcriptional Regulation by Integrative Analysis of Transcription Factor Binding Data

Multiomics-integrated Deep Language Model Enables in Silico Genome-Wide Detection of Transcription Factor Binding Site in Unexplored Biosamples

A Structural-Based Strategy for Recognition of Transcription Factor Binding Sites.

Predicting transcription factor binding motifs from DNA-binding domains, chromatin accessibility and gene expression data

Modelling the transcription factor DNA-binding affinity using genome-wide ChIP-based data

Prediction of Transcription Factor Binding Sites on Cell-Free DNA Based on Deep Learning

MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites

Stability selection for regression-based models of transcription factor-DNA binding specificity

Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention

Prediction of Transcription Factor Binding Sites Using a Combined Deep Learning Approach

Predicting Transcription Factor Specificity with All-Atom Models

An Integrative Analysis of TFBS-clustered Regions Reveals New Transcriptional Regulation Models on the Accessible Chromatin Landscape

DeepSTF: predicting transcription factor binding sites by interpretable deep neural networks combining sequence and shape

Estimating binding properties of transcription factors from genome-wide binding profiles