Abstract:This paper applies a deep convolutional/highway MLP framework to classify genomic sequences on the transcription factor binding site task. To make the model understandable, we propose an optimization driven strategy to extract "motifs", or symbolic patterns which visualize the positive class learned by the network. We show that our system, Deep Motif (DeMo), extracts motifs that are similar to, and in some cases outperform the current well known motifs. In addition, we find that a deeper model consisting of multiple convolutional and highway layers can outperform a single convolutional and fully connected layer in the previous state-of-the-art.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accurately classify and understand transcription factor binding sites (TFBSs) in DNA sequences through deep - learning methods, in order to better understand biological processes and support biomedical research related to human health. Specifically, the goal of this task is, given a DNA sequence, to determine whether there are binding sites for specific transcription factors (TFs) in it. ### Main Problem Background 1. **Importance of Transcription Factor Binding Sites** - Transcription factors (TFs) are regulatory proteins. They bind to specific binding sites (TFBSs) on DNA to regulate cellular mechanisms. - Accurately identifying these binding sites is crucial for understanding gene expression regulation and related diseases. 2. **Limitations of Existing Technologies** - **ChIP - seq Experiments**: Although they can find the positions of binding sites, they cannot reveal the common patterns of all positive binding sites and it is difficult to explain why TFs bind to these positions. - **Traditional Methods**: For example, methods based on subset frequency counting may not be able to generalize to unseen examples; algorithms based on string kernel functions are limited by computational complexity. 3. **Requirements** - There is a need for large - scale computational methods that can not only accurately classify binding sites but also generate clear patterns (motifs) representing positive binding sites. ### Main Contributions of the Paper 1. **Proposing the Deep Motif (DeMo) Model** - Using a deep convolutional/highway multi - layer perceptron (highway MLP) network, it achieves higher TFBS classification accuracy than existing methods. 2. **Extracting Interpretable Visual Patterns (motifs)** - An optimization - driven strategy is proposed to extract "motifs", that is, symbolic patterns, from the model to visualize the positive classes learned by the network. - The generated motifs are not only similar to known motifs but also, in some cases, superior to existing motifs. ### Technical Details - **Deep Model Structure** - Input: Original nucleotide characters (one - hot encoded) - Multiple convolutional layers (each layer contains 128 filters with a length of 5), and some layers use max - pooling with a length of 2 - After max - pooling, a fully - connected highway network (highway MLP) is connected - Output: Binary - classification softmax function - **Motif Generation** - By optimizing the input sequence to maximize the probability of its corresponding TFBS (the formula is as follows): \[ \arg\max_S P^+(S)+\lambda\|S\|_2^2 \] - Where \(P^+(S)\) is the probability that the input sequence \(S\) is a positive TFBS, and \(\lambda\) is a regularization parameter. Through these methods, the paper not only improves the classification accuracy but also provides an intuitive understanding of the patterns of positive binding sites, which is helpful for biomedical research.

Deep Motif: Visualizing Genomic Sequence Classifications

Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks

NeuronMotif: Deciphering transcriptional cis-regulatory codes from deep neural networks

NeuronMotif: Deciphering Cis-Regulatory Codes by Layer-Wise Demixing of Deep Neural Networks.

Identifying Complex Motifs in Massive Omics Data with a Variable-Convolutional Layer in Deep Neural Network

Identifying DNA Sequence Motifs Using Deep Learning

DeepSS: Exploring Splice Site Motif Through Convolutional Neural Network Directly from DNA Sequence

Deep Learning-Based Motif Discovery in Major Histocompatibility Complex: A Primer

An Interpretation of Convolutional Neural Networks for Motif Finding from the View of Probability

A mechanistically interpretable neural network for regulatory genomics

A survey on deep learning in DNA/RNA motif mining

Convolutional Motif Kernel Networks

Unveil cis -acting combinatorial mRNA motifs by interpreting deep neural network

Inherently interpretable position-aware convolutional motif kernel networks for biological sequencing data

DeepEnhancer: Predicting Enhancers by Convolutional Neural Networks.

Using Deep Learning to Predict Transcription Factor Binding Sites Based on Multiple-omics Data

Memory Matching Networks for Genomic Sequence Classification

DeepChrome: deep-learning for predicting gene expression from histone modifications

Prototype Matching Networks for Large-Scale Multi-label Genomic Sequence Classification

idMotif: An Interactive Motif Identification in Protein Sequences

Deepprune: Learning Efficient and Interpretable Convolutional Networks Through Weight Pruning for Predicting DNA-protein Binding