Abstract:Abstract Background Evolutionary conservation is an invaluable tool for inferring functional significance in the genome, including regions that are crucial across many species and those that have undergone convergent evolution. Computational methods to test for sequence conservation are dominated by algorithms that examine the ability of one or more nucleotides to align across large evolutionary distances. While these nucleotide alignment-based approaches have proven powerful for protein-coding genes and some non-coding elements, they fail to capture conservation of many enhancers, distal regulatory elements that control spatial and temporal patterns of gene expression. The function of enhancers is governed by a complex, often tissue- and cell type-specific code that links combinations of transcription factor binding sites and other regulation-related sequence patterns to regulatory activity. Thus, function of orthologous enhancer regions can be conserved across large evolutionary distances, even when nucleotide turnover is high. Results We present a new machine learning-based approach for evaluating enhancer conservation that leverages the combinatorial sequence code of enhancer activity rather than relying on the alignment of individual nucleotides. We first train a convolutional neural network model that can predict tissue-specific open chromatin, a proxy for enhancer activity, across mammals. Next, we apply that model to distinguish instances where the genome sequence would predict conserved function versus a loss of regulatory activity in that tissue. We present criteria for systematically evaluating model performance for this task and use them to demonstrate that our models accurately predict tissue-specific conservation and divergence in open chromatin between primate and rodent species, vastly out-performing leading nucleotide alignment-based approaches. We then apply our models to predict open chromatin at orthologs of brain and liver open chromatin regions across hundreds of mammals and find that brain enhancers associated with neuron activity have a stronger tendency than the general population to have predicted lineage-specific open chromatin. Conclusion The framework presented here provides a mechanism to annotate tissue-specific regulatory function across hundreds of genomes and to study enhancer evolution using predicted regulatory differences rather than nucleotide-level conservation measurements.

What can we learn from noncoding regions of similarity between genomes?

Systematic Identification of Conserved Motif Modules in the Human Genome

Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties

Gapped-kmer sequence modeling robustly identifies regulatory vocabularies and distal enhancers conserved between evolutionarily distant mammals

Inferring mammalian tissue-specific regulatory conservation by predicting tissue-specific differences in open chromatin

Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals

Eukaryotic Regulatory Element Conservation Analysis and Identification Using Comparative Genomics

Image-based Promoter Prediction: a Promoter Prediction Method Based on Evolutionarily Generated Patterns.

Wide-Scale Analysis of Human Functional Transcription Factor Binding Reveals a Strong Bias towards the Transcription Start Site

Prediction of Prokaryotic and Eukaryotic Promoters Using Convolutional Deep Learning Neural Networks

A Bag-Of-Motif Model Captures Cell States at Distal Regulatory Sequences

Fundamentals for predicting transcriptional regulations from DNA sequence patterns

Genome-wide comparative analysis reveals human-mouse regulatory landscape and evolution

It's not junk!: the search for functional elements in noncoding DNA

Genome-wide analysis of mammalian promoter architecture and evolution

Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks

GPMiner: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group

A Pattern-Based Nearest Neighbor Search Approach for Promoter Prediction Using DNA Structural Profiles

Flawed machine-learning confounds coding sequence annotation

Understanding Distal Transcriptional Regulation from Sequence Motif, Network Inference and Interactome Perspectives

Predicting unrecognized enhancer-mediated genome topology by an ensemble machine learning model