DARDN: A Deep-Learning Approach for CTCF Binding Sequence Classification and Oncogenic Regulatory Feature Discovery

Hyun Jae Cho,Zhenjia Wang,Yidan Cong,Stefan Bekiranov,Aidong Zhang,Chongzhi Zang
DOI: https://doi.org/10.3390/genes15020144
IF: 4.141
2024-01-24
Genes
Abstract:Characterization of gene regulatory mechanisms in cancer is a key task in cancer genomics. CCCTC-binding factor (CTCF), a DNA binding protein, exhibits specific binding patterns in the genome of cancer cells and has a non-canonical function to facilitate oncogenic transcription programs by cooperating with transcription factors bound at flanking distal regions. Identification of DNA sequence features from a broad genomic region that distinguish cancer-specific CTCF binding sites from regular CTCF binding sites can help find oncogenic transcription factors in a cancer type. However, the presence of long DNA sequences without localization information makes it difficult to perform conventional motif analysis. Here, we present DNAResDualNet (DARDN), a computational method that utilizes convolutional neural networks (CNNs) for predicting cancer-specific CTCF binding sites from long DNA sequences and employs DeepLIFT, a method for interpretability of deep learning models that explains the model's output in terms of the contributions of its input features. The method is used for identifying DNA sequence features associated with cancer-specific CTCF binding. Evaluation on DNA sequences associated with CTCF binding sites in T-cell acute lymphoblastic leukemia (T-ALL) and other cancer types demonstrates DARDN's ability in classifying DNA sequences surrounding cancer-specific CTCF binding from control constitutive CTCF binding and identifying sequence motifs for transcription factors potentially active in each specific cancer type. We identify potential oncogenic transcription factors in T-ALL, acute myeloid leukemia (AML), breast cancer (BRCA), colorectal cancer (CRC), lung adenocarcinoma (LUAD), and prostate cancer (PRAD). Our work demonstrates the power of advanced machine learning and feature discovery approach in finding biologically meaningful information from complex high-throughput sequencing data.
genetics & heredity
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of how to identify DNA sequence features related to cancer - specific CTCF binding sites from long DNA sequences. Specifically, researchers are concerned with how to distinguish cancer - specific CTCF binding sites from regular CTCF binding sites and discover oncogenic transcription factors related to specific cancer types from them. ### Background and challenges 1. **Gene regulation mechanism**: In cancer genomics, characterizing gene regulation mechanisms is a crucial task. CTCF (CCCTC - binding factor) is a DNA - binding protein that exhibits specific binding patterns in the genomes of cancer cells and promotes oncogenic transcription programs by cooperating with transcription factors in other distal regions. 2. **Limitations of traditional methods**: Traditional TF (transcription factor) motif search methods are not applicable in this situation because the position of the target oncogenic TF binding site relative to the cancer - specific CTCF site is unknown and may be very far, and at the same time, the search space is huge and lacks appropriate control sequences. 3. **Data imbalance**: The number of cancer - specific CTCF binding sites is usually much less than that of regular CTCF binding sites, which leads to the data imbalance problem, making it difficult for traditional machine - learning methods to classify effectively. ### Solutions To address the above challenges, researchers proposed DNAResDualNet (DARDN), a method based on deep convolutional neural network (CNN) for predicting cancer - specific CTCF binding sites and interpreting the model's output through the DeepLIFT method, thereby identifying DNA sequence features related to cancer - specific CTCF binding. ### Main contributions 1. **Model design**: DARDN utilizes two CNN models with different initial kernel sizes and introduces residual connections to enhance the classification accuracy of long DNA sequences. 2. **Data augmentation**: Data augmentation is carried out through methods such as reverse complement and random shift to alleviate the data imbalance problem. 3. **Feature interpretation**: The DeepLIFT method is used to interpret the model's output and identify DNA sequence features that contribute significantly to the classification results. 4. **Application verification**: Verification was carried out on data of T - ALL and other cancer types, demonstrating the effectiveness of DARDN in classifying DNA sequences and identifying oncogenic transcription factors. ### Conclusion This study demonstrates the powerful ability of deep - learning and feature - discovery methods in extracting biologically meaningful information from complex high - throughput sequencing data, providing new tools and methods for understanding the mechanisms of cancer development.