Convolutional Neural Networks for Regulatory Genomics

Niels ten Dijke,W. Kowalczyk,Gerard van Westen,D. Zerbino
Abstract:A majority of the human genome consists of sequences that do not code for a particular protein, so called non-coding DNA. The non-coding regions nonetheless play a vital role in gene expression. These non-coding regions of the DNA contain cis-regulatory elements such as promoters and enhancers. These regions can be bound by transcription factor proteins and thereby controlling the rate of transcription of DNA to messenger RNA. This then helps to regulate the expression of nearby genes. Next-generation sequencing (NGS) techniques allow for identifying and studying the genomic factors such as transcription factor binding, histone modifications and open chromatin that underlie transcription with great sequencing depth. Furthermore, these data allow researchers to build predictive models for these events using machine learning approaches, which permit the annotation of new cell types without having to perform the experiment. In particular, convolutional neural networks seem to be well suited to model genomic data. A convolutional neural network (CNN) is a type of feed-forward neural network inspired by the animal visual cortex. CNNs are characterized by having spatially local connections. This connectivity pattern allows CNNs to be effective on data that have a grid-like topologies. In other words, data that can be represented by nodes which are connected to neighbors along one or more dimensions, where neighboring elements have statistical dependencies. Recently, algorithmic advances as well as great improvements in processing capabilities and tools and better datasets have made it possible to train increasingly complex models. Indeed, deep convolutional neural networks have proven to be very successful on many artificial intelligence tasks such as image classification, finding policy and value functions for game playing AI and drug discovery. As for typical NGS data, which includes DNA sequences, open chromatin and transcription factor binding data, these are all one dimensional grids. Identifying transcription factor binding sites can greatly help researchers understand the transcription process and the underlying factors to genetic diseases. In the first experiment, convolutional neural networks models were built to predict transcription factor binding sites from sequence, open chromatin, gene expression and DNA shape data. We found the convolutional neural network to perform close to the state of the art on some transcription factors, while performing significantly worse on others. Building models for each task separately resulted in better predictive performance than a multi-task network modeling all transcription factors simultaneously. In the second experiment, we took a closer look at the transcription process. The exact location of transcription initiation, the transcription start site (TSS), can be determined experimentally at base pair resolution. Unlike translation, where the exact amino acid triplet for starting the translation process is known, translation is less well understood. We studied the transcription process by building a convolutional neural network to predict the exact positions of the transcription starts sites. The trained models were then interpreted, which lead to the finding that the area directly around the TSS site is most decisive factor for determining whether a particular base is a TSS, which to best of our knowledge is not reported in literature.
What problem does this paper attempt to address?