Abstract:Accurate prediction of DNA-protein binding (DPB) is of great biological significance for studying the regulatory mechanism of gene expression. In recent years, with the rapid development of deep learning techniques, advanced deep neural networks have been introduced into the field and shown to significantly improve the prediction performance of DPB. However, these methods are primarily based on the DNA sequences measured by the ChIP-seq technology, failing to consider the possible partial variations of the motif sequences and errors of the sequencing technology itself. To address this, we propose a novel computational method, termed MSDenseNet, which combines a new fault-tolerant coding (FTC) scheme with the dense connectional deep neural networks. Three important factors can be attributed to the success of MSDenseNet: First, MSDenseNet utilizes a powerful feature representation approach, which transforms the raw DNA sequence into fusion coding using the fault-tolerant feature sequence; Second, in terms of network structure, MSDenseNet uses a multi-scale convolution within the dense layer and the multi-scale convolution preceding the dense block. This is shown to be able to significantly improve the network performance and accelerate the network convergence speed, and third, building upon the advanced deep neural network, MSDenseNet is capable of effectively mining the hidden complex relationship between the internal attributes of fusion sequence features to enhance the prediction of DPB. Benchmarking experiments on 690 ChIP-seq datasets show that MSDenseNet achieves an average AUC of 0.933 and outperforms the state-of-the-art method. The source code of MSDenseNet is available at https://github.com/csbio-njust-edu/msdensenet. The results show that MSDenseNet can effectively predict DPB. We anticipate that MSDenseNet will be exploited as a powerful tool to facilitate a more exhaustive understanding of DNA-binding proteins and help toward their functional characterization.

Employing bimodal representations to predict DNA bendability within a self-supervised pre-trained framework

Assessing base-resolution DNA mechanics on the genome scale

DNAcycP: a deep learning tool for DNA cyclizability prediction

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

A Quantum Chemical Convolutional Neural Network Model for Predicting Thermodynamics and Kinetics of DNA Molecules from Sequences

A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language

Exploring Protein-DNA Binding Residue Prediction and Consistent Interpretability Analysis Using Deep Learning

Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling

Predicting DNA structure using a deep learning method

Bind-and-bend model for DNA looping

ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction

Improving the prediction of DNA-protein binding by integrating multi-scale dense convolutional network with fault-tolerant coding

Measuring DNA mechanics on the genome scale

Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models

Accelerating the characterization of dynamic DNA origami devices with deep neural networks

DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins

DeepMethylation: a deep learning based framework with GloVe and Transformer encoder for DNA methylation prediction

DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors

Analyzing DNA Hybridization via machine learning

BiLSTM-5mC: A Bidirectional Long Short-Term Memory-Based Approach for Predicting 5-Methylcytosine Sites in Genome-Wide DNA Promoters