Abstract:Determining transcriptional factor binding sites (TFBSs) is critical for understanding the molecular mechanisms regulating gene expression in different biological conditions. Biological assays designed to directly mapping TFBSs require large sample size and intensive resources. As an alternative, ATAC-seq assay is simple to conduct and provides genomic cleavage profiles that contain rich information for imputing TFBSs indirectly. Previous footprint-based tools are inheritably limited by the accuracy of their bias correction algorithms and the efficiency of their feature extraction models. Here we introduce TAMC ( T ranscriptional factor binding prediction from A TAC-seq profile at M otif-predicted binding sites using C onvolutional neural networks), a deep-learning approach for predicting motif-centric TF binding activity from paired-end ATAC-seq data. TAMC does not require bias correction during signal processing. By leveraging a one-dimensional convolutional neural network (1D-CNN) model, TAMC make predictions based on both footprint and non-footprint features at binding sites for each TF and outperforms existing footprinting tools in TFBS prediction particularly for ATAC-seq data with limited sequencing depth. Applications of deep learning models are rapidly gaining popularity in recent biological studies because of their efficiency in analyzing non-linear patterns from feature-rich data. In this study, we developed a deep learning method to predict transcription factor binding sites based on chromatin accessibility profiles. Compared to previous methods using scoring functions and classical machine learning algorithms, our method forgoes the need for bias correction during signal processing and significantly increases the efficiency in extracting features at transcription factor binding sites. In addition, we showed that our method outperforms previous methods particularly for chromatin accessibility data with shallow sequencing depth. In this study, we applied our method to prediction of changes in binding sites of a transcription factor, CTCF, during early embryonic development based on bulk chromatin accessibility profiles. We then discussed about the potential application of our method to transcription factor binding site prediction using single-cell chromatin accessibility profiles as well as possible strategies to further improve the performance of our method in the future.

OCRFinder: a Noise-Tolerance Machine Learning Method for Accurately Estimating Open Chromatin Regions

Scart: Recognizing Cell Clusters and Constructing Trajectory from Single-Cell Epigenomic Data

DeepOCR: A multi-species deep-learning framework for accurate identification of open chromatin regions in livestock

Chromatin Accessibility Prediction Via a Hybrid Deep Convolutional Neural Network

Identifying OCRs in cfDNA WGS Data by Correlation Clustering

Detecting novel cell type in single-cell chromatin accessibility data via open-set domain adaptation

Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

TRAFICA: An Open Chromatin Language Model to Improve Transcription Factor Binding Affinity Prediction

csORF-finder: an effective ensemble learning framework for accurate identification of multi-species coding short open reading frames

DeepCAGE: Incorporating Transcription Factors in Genome-wide Prediction of Chromatin Accessibility

Characterization of chromatin accessibility patterns in different mouse cell types using machine learning methods at single-cell resolution

scEpiLock: A Weakly Supervised Learning Framework for cis-Regulatory Element Localization and Variant Impact Quantification for Single-Cell Epigenetic Data

SilenceREIN: seeking silencers on anchors of chromatin loops by deep graph neural networks

RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State

Enhancement and Imputation of Peak Signal Enables Accurate Cell-Type Classification in scATAC-seq

Comparison of differential accessibility analysis strategies for ATAC-seq data

Predicted constrained accessible regions mark regulatory elements and causal variants

ROCCO: a robust method for detection of open chromatin via convex optimization

The full set of potential open regions (PORs) in the human genome defined by consensus peaks of ATAC-seq data

Global prediction of chromatin accessibility using small-cell-number and single-cell RNA-seq

TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile