Discovering DNA shape motifs with multiple DNA shape features: generalization, methods, and validation

Nanjun Chen,Jixiang Yu,Zhe Liu,Lingkuan Meng,Xiangtao Li,Ka-Chun Wong
DOI: https://doi.org/10.1093/nar/gkae210
IF: 14.9
2024-04-04
Nucleic Acids Research
Abstract:Abstract DNA motifs are crucial patterns in gene regulation. DNA-binding proteins (DBPs), including transcription factors, can bind to specific DNA motifs to regulate gene expression and other cellular activities. Past studies suggest that DNA shape features could be subtly involved in DNA–DBP interactions. Therefore, the shape motif annotations based on intrinsic DNA topology can deepen the understanding of DNA–DBP binding. Nevertheless, high-throughput tools for DNA shape motif discovery that incorporate multiple features altogether remain insufficient. To address it, we propose a series of methods to discover non-redundant DNA shape motifs with the generalization to multiple motifs in multiple shape features. Specifically, an existing Gibbs sampling method is generalized to multiple DNA motif discovery with multiple shape features. Meanwhile, an expectation-maximization (EM) method and a hybrid method coupling EM with Gibbs sampling are proposed and developed with promising performance, convergence capability, and efficiency. The discovered DNA shape motif instances reveal insights into low-signal ChIP-seq peak summits, complementing the existing sequence motif discovery works. Additionally, our modelling captures the potential interplays across multiple DNA shape features. We provide a valuable platform of tools for DNA shape motif discovery. An R package is built for open accessibility and long-lasting impact: https://zenodo.org/doi/10.5281/zenodo.10558980.
biochemistry & molecular biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Existing high - throughput tools are still insufficient when combining multiple DNA shape features for DNA shape motif discovery. Specifically, although previous studies have shown that DNA shape features play an important role in DNA - protein interactions and can deepen the understanding of gene regulation, most current methods can only handle a single shape feature and are unable to handle multiple shape features simultaneously. To solve this problem, the author proposes a series of new methods to discover non - redundant DNA shape motifs with multiple shape features. These methods include: 1. **Generalized Gibbs Sampling Method**: Extend the existing Gibbs sampling method so that it can handle multiple DNA shape motifs and multiple shape features. 2. **Expectation - Maximization (EM) Algorithm**: Develop a new EM algorithm for shape motif discovery. 3. **Hybrid Method (EM - Gibbs)**: Combine the advantages of the EM algorithm and Gibbs sampling, and propose a hybrid method to improve performance, convergence ability and efficiency. In addition, the author also provides a valuable tool platform for DNA shape motif discovery and develops an R package for public access and long - term impact. Through these methods, the author hopes to reveal more insights in the low - signal ChIP - seq peak regions, supplement existing sequence motif discovery work, and capture the potential interactions between multiple DNA shape features.