Abstract:Abstract Efficient and accurate recognition of protein–DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein–DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.

LangMoDHS: A deep learning language model for predicting DNase I hypersensitive sites in mouse genome

iDHS-Deep: an integrated tool for predicting DNase I hypersensitive sites by deep neural network

Idhs-Fflg: Identifying DNase I Hypersensitive Sites by Feature Fusion and Local–Global Feature Extraction Network

Idhs-Dsams: Identifying DNase I Hypersensitive Sites Based on the Dinucleotide Property Matrix and Ensemble Bagged Tree

The prediction of human DNase I hypersensitive sites based on DNA sequence information

Idhs-Dt: Identifying DNase I Hypersensitive Sites by Integrating DNA Dinucleotide and Trinucleotide Information

Identification of DNase I Hypersensitive Sites in the Human Genome by Multiple Sequence Descriptors

Recognition of DNase I hypersensitive sites in multiple cell lines.

DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest

DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding

Deep Learning Based Method for Predicting DNA N6-methyladenosine Sites.

Predicting DNA Methylation States with Hybrid Information Based Deep-Learning Model

MLSNet: a deep learning model for predicting transcription factor binding sites

Deep6mAPred: A CNN and Bi-LSTM-based deep learning method for predicting DNA N6-methyladenosine sites across plant species

Predicting the sequence specificities of DNA-binding proteins by DNA Fine-tuned Language Model with decaying learning rates

Predicting DNase I Hypersensitive Sites Via Un-Biased Pseudo Trinucleotide Composition

CEPZ: A Novel Predictor for Identification of DNase I Hypersensitive Sites

iDHS-RGME: Identification of DNase I hypersensitive sites by integrating information on nucleotide composition and physicochemical properties

Predicting Functional Elements and Variants Effects in Non-Coding Regions Based on Deep Learning

Deeptf: Accurate Prediction Of Transcription Factor Binding Sites By Combining Multi-Scale Convolution And Long Short-Term Memory Neural Network

ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction