Enhancer-MDLF: a novel deep learning framework for identifying cell-specific enhancers

Yao Zhang,Pengyu Zhang,Hao Wu
DOI: https://doi.org/10.1093/bib/bbae083
IF: 9.5
2024-01-22
Briefings in Bioinformatics
Abstract:Abstract Enhancers, noncoding DNA fragments, play a pivotal role in gene regulation, facilitating gene transcription. Identifying enhancers is crucial for understanding genomic regulatory mechanisms, pinpointing key elements and investigating networks governing gene expression and disease-related mechanisms. Existing enhancer identification methods exhibit limitations, prompting the development of our novel multi-input deep learning framework, termed Enhancer-MDLF. Experimental results illustrate that Enhancer-MDLF outperforms the previous method, Enhancer-IF, across eight distinct human cell lines and exhibits superior performance on generic enhancer datasets and enhancer–promoter datasets, affirming the robustness of Enhancer-MDLF. Additionally, we introduce transfer learning to provide an effective and potential solution to address the prediction challenges posed by enhancer specificity. Furthermore, we utilize model interpretation to identify transcription factor binding site motifs that may be associated with enhancer regions, with important implications for facilitating the study of enhancer regulatory mechanisms. The source code is openly accessible at https://github.com/HaoWuLab-Bioinformatics/Enhancer-MDLF.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The main objective of this paper is to propose a new deep learning framework, called Enhancer-MDLF (Multi-input Deep Learning Framework), for identifying cell-specific enhancers. Specifically, the paper aims to address the following key issues: 1. **Limitations of existing methods**: Existing enhancer identification methods have certain limitations, including methods based on conserved sequence and transcription factor binding site data, methods using ChIP-seq data, methods relying on chromatin accessibility-related data, and methods using histone modification data or enhancer RNA (eRNA) data. These methods either have a high false positive rate, cannot distinguish enhancers from promoter regions, or are limited in predicting enhancers with inactive transcription. 2. **Need for computational tools**: Due to the time-consuming and costly nature of experimental methods, there is a need to develop reliable computational tools to identify enhancers. 3. **Issues with existing computational methods**: Although several computational methods have been proposed for enhancer identification, they are usually based on a general dataset containing 9 different cell lines, which overlooks the cell specificity of enhancers. Additionally, these methods may perform poorly when handling sequences of unequal lengths and require significant time for parameter optimization when applied to new cell lines. 4. **Improving existing frameworks**: The Enhancer-IF framework mentioned in the paper, although considering cell specificity, still needs improvement in predictive performance, and its model lacks interpretability, making it difficult to explore the role of transcription factor binding sites (TFBS) in enhancer regions. To address the above challenges, the paper proposes the Enhancer-MDLF, a multi-input deep learning framework that combines word vector features of human genome sequences and motif features extracted from position weight matrices (PWM). Comprehensive evaluations on various datasets demonstrate that Enhancer-MDLF has significant advantages over previous methods, particularly in cell-specific enhancer prediction. Additionally, the framework introduces transfer learning to address cross-cell line prediction challenges brought by enhancer specificity and provides model interpretability to identify the most important TFBS motifs within enhancer regions.