Predicting DNA structure using a deep learning method

Jinsen Li,Tsu-Pei Chiu,Remo Rohs
DOI: https://doi.org/10.1038/s41467-024-45191-5
IF: 16.6
2024-02-09
Nature Communications
Abstract:Abstract Understanding the mechanisms of protein-DNA binding is critical in comprehending gene regulation. Three-dimensional DNA structure, also described as DNA shape, plays a key role in these mechanisms. In this study, we present a deep learning-based method, Deep DNAshape, that fundamentally changes the current k -mer based high-throughput prediction of DNA shape features by accurately accounting for the influence of extended flanking regions, without the need for extensive molecular simulations or structural biology experiments. By using the Deep DNAshape method, DNA structural features can be predicted for any length and number of DNA sequences in a high-throughput manner, providing an understanding of the effects of flanking regions on DNA structure in a target region of a sequence. The Deep DNAshape method provides access to the influence of distant flanking regions on a region of interest. Our findings reveal that DNA shape readout mechanisms of a core target are quantitatively affected by flanking regions, including extended flanking regions, providing valuable insights into the detailed structural readout mechanisms of protein-DNA binding. Furthermore, when incorporated in machine learning models, the features generated by Deep DNAshape improve the model prediction accuracy. Collectively, Deep DNAshape can serve as versatile and powerful tool for diverse DNA structure-related studies.
multidisciplinary sciences
What problem does this paper attempt to address?
This paper aims to solve the key problems in the protein - DNA binding mechanism, especially the influence of DNA shape on these binding mechanisms. Specifically, the research focuses on developing a deep - learning - based method - Deep DNAshape - to predict DNA structural features in a high - throughput manner. This method significantly improves the current high - throughput prediction methods based on k - mer by considering the influence of the extended flanking regions without the need for a large number of molecular simulations or structural biology experiments. The following are the specific problems that this paper attempts to solve: 1. **Improve the accuracy of DNA shape prediction**: Existing methods such as DNAshape rely on pentamer lookup tables, which can only consider the influence of the nearest and next - nearest neighbors and ignore the influence of the sequence environment in a more distant range. Deep DNAshape overcomes this limitation through a deep - learning model and can predict DNA shape features more accurately, especially when considering the influence of the extended flanking regions. 2. **Understand the influence of flanking regions on the core DNA structure**: The paper explores how flanking regions affect the core DNA structure by high - throughput prediction of DNA shape features. This helps to gain a deeper understanding of the detailed structural read - out mechanism of protein - DNA binding, especially for transcription factors (TFs) with long core motifs. 3. **Improve the prediction accuracy of machine - learning models**: The research found that when the features generated by Deep DNAshape are incorporated into machine - learning models, the prediction accuracy of the models for TF - DNA binding specificity can be significantly improved. This provides new tools and methods for the study of gene regulation mechanisms. 4. **Predict DNA shape fluctuations**: In addition to static DNA shape features, Deep DNAshape can also predict DNA shape fluctuations, which helps to understand the conformational flexibility of DNA molecules and their influence on protein binding. In summary, the main objective of this paper is to provide an efficient and accurate tool for predicting DNA shape features and their fluctuations by developing the Deep DNAshape method, so as to better understand the protein - DNA binding mechanism and improve the prediction ability of relevant machine - learning models.