Deep Motif: Visualizing Genomic Sequence Classifications

Jack Lanchantin,Ritambhara Singh,Zeming Lin,Yanjun Qi
DOI: https://doi.org/10.48550/arXiv.1605.01133
2016-06-02
Abstract:This paper applies a deep convolutional/highway MLP framework to classify genomic sequences on the transcription factor binding site task. To make the model understandable, we propose an optimization driven strategy to extract "motifs", or symbolic patterns which visualize the positive class learned by the network. We show that our system, Deep Motif (DeMo), extracts motifs that are similar to, and in some cases outperform the current well known motifs. In addition, we find that a deeper model consisting of multiple convolutional and highway layers can outperform a single convolutional and fully connected layer in the previous state-of-the-art.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately classify and understand transcription factor binding sites (TFBSs) in DNA sequences through deep - learning methods, in order to better understand biological processes and support biomedical research related to human health. Specifically, the goal of this task is, given a DNA sequence, to determine whether there are binding sites for specific transcription factors (TFs) in it. ### Main Problem Background 1. **Importance of Transcription Factor Binding Sites** - Transcription factors (TFs) are regulatory proteins. They bind to specific binding sites (TFBSs) on DNA to regulate cellular mechanisms. - Accurately identifying these binding sites is crucial for understanding gene expression regulation and related diseases. 2. **Limitations of Existing Technologies** - **ChIP - seq Experiments**: Although they can find the positions of binding sites, they cannot reveal the common patterns of all positive binding sites and it is difficult to explain why TFs bind to these positions. - **Traditional Methods**: For example, methods based on subset frequency counting may not be able to generalize to unseen examples; algorithms based on string kernel functions are limited by computational complexity. 3. **Requirements** - There is a need for large - scale computational methods that can not only accurately classify binding sites but also generate clear patterns (motifs) representing positive binding sites. ### Main Contributions of the Paper 1. **Proposing the Deep Motif (DeMo) Model** - Using a deep convolutional/highway multi - layer perceptron (highway MLP) network, it achieves higher TFBS classification accuracy than existing methods. 2. **Extracting Interpretable Visual Patterns (motifs)** - An optimization - driven strategy is proposed to extract "motifs", that is, symbolic patterns, from the model to visualize the positive classes learned by the network. - The generated motifs are not only similar to known motifs but also, in some cases, superior to existing motifs. ### Technical Details - **Deep Model Structure** - Input: Original nucleotide characters (one - hot encoded) - Multiple convolutional layers (each layer contains 128 filters with a length of 5), and some layers use max - pooling with a length of 2 - After max - pooling, a fully - connected highway network (highway MLP) is connected - Output: Binary - classification softmax function - **Motif Generation** - By optimizing the input sequence to maximize the probability of its corresponding TFBS (the formula is as follows): \[ \arg\max_S P^+(S)+\lambda\|S\|_2^2 \] - Where \(P^+(S)\) is the probability that the input sequence \(S\) is a positive TFBS, and \(\lambda\) is a regularization parameter. Through these methods, the paper not only improves the classification accuracy but also provides an intuitive understanding of the patterns of positive binding sites, which is helpful for biomedical research.