Abstract:Characterization of gene regulatory mechanisms in cancer is a key task in cancer genomics. CCCTC-binding factor (CTCF), a DNA binding protein, exhibits specific binding patterns in the genome of cancer cells and has a non-canonical function to facilitate oncogenic transcription programs by cooperating with transcription factors bound at flanking distal regions. Identification of DNA sequence features from a broad genomic region that distinguish cancer-specific CTCF binding sites from regular CTCF binding sites can help find oncogenic transcription factors in a cancer type. However, the presence of long DNA sequences without localization information makes it difficult to perform conventional motif analysis. Here, we present DNAResDualNet (DARDN), a computational method that utilizes convolutional neural networks (CNNs) for predicting cancer-specific CTCF binding sites from long DNA sequences and employs DeepLIFT, a method for interpretability of deep learning models that explains the model's output in terms of the contributions of its input features. The method is used for identifying DNA sequence features associated with cancer-specific CTCF binding. Evaluation on DNA sequences associated with CTCF binding sites in T-cell acute lymphoblastic leukemia (T-ALL) and other cancer types demonstrates DARDN's ability in classifying DNA sequences surrounding cancer-specific CTCF binding from control constitutive CTCF binding and identifying sequence motifs for transcription factors potentially active in each specific cancer type. We identify potential oncogenic transcription factors in T-ALL, acute myeloid leukemia (AML), breast cancer (BRCA), colorectal cancer (CRC), lung adenocarcinoma (LUAD), and prostate cancer (PRAD). Our work demonstrates the power of advanced machine learning and feature discovery approach in finding biologically meaningful information from complex high-throughput sequencing data.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of how to identify DNA sequence features related to cancer - specific CTCF binding sites from long DNA sequences. Specifically, researchers are concerned with how to distinguish cancer - specific CTCF binding sites from regular CTCF binding sites and discover oncogenic transcription factors related to specific cancer types from them. ### Background and challenges 1. **Gene regulation mechanism**: In cancer genomics, characterizing gene regulation mechanisms is a crucial task. CTCF (CCCTC - binding factor) is a DNA - binding protein that exhibits specific binding patterns in the genomes of cancer cells and promotes oncogenic transcription programs by cooperating with transcription factors in other distal regions. 2. **Limitations of traditional methods**: Traditional TF (transcription factor) motif search methods are not applicable in this situation because the position of the target oncogenic TF binding site relative to the cancer - specific CTCF site is unknown and may be very far, and at the same time, the search space is huge and lacks appropriate control sequences. 3. **Data imbalance**: The number of cancer - specific CTCF binding sites is usually much less than that of regular CTCF binding sites, which leads to the data imbalance problem, making it difficult for traditional machine - learning methods to classify effectively. ### Solutions To address the above challenges, researchers proposed DNAResDualNet (DARDN), a method based on deep convolutional neural network (CNN) for predicting cancer - specific CTCF binding sites and interpreting the model's output through the DeepLIFT method, thereby identifying DNA sequence features related to cancer - specific CTCF binding. ### Main contributions 1. **Model design**: DARDN utilizes two CNN models with different initial kernel sizes and introduces residual connections to enhance the classification accuracy of long DNA sequences. 2. **Data augmentation**: Data augmentation is carried out through methods such as reverse complement and random shift to alleviate the data imbalance problem. 3. **Feature interpretation**: The DeepLIFT method is used to interpret the model's output and identify DNA sequence features that contribute significantly to the classification results. 4. **Application verification**: Verification was carried out on data of T - ALL and other cancer types, demonstrating the effectiveness of DARDN in classifying DNA sequences and identifying oncogenic transcription factors. ### Conclusion This study demonstrates the powerful ability of deep - learning and feature - discovery methods in extracting biologically meaningful information from complex high - throughput sequencing data, providing new tools and methods for understanding the mechanisms of cancer development.

DARDN: A Deep-Learning Approach for CTCF Binding Sequence Classification and Oncogenic Regulatory Feature Discovery

DARDN: A deep-learning approach for CTCF binding sequence classification and oncogenic regulatory feature discovery

Deep Learning for Cancer Type Classification and Driver Gene Identification

DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network

DeepLION2: deep multi-instance contrastive learning framework enhancing the prediction of cancer-associated T cell receptors by attention strategy on motifs

CADTAD: CAncer Driver Topologically Associated Domains identify oncogenic and tumor suppressive lncRNAs

TBCA: Prediction of transcription factor binding sites using a deep neural network with lightweight attention mechanism

Automated exploitation of deep learning for cancer patient stratification across multiple types

Deep Learning Implicitly Handles Tissue Specific Phenomena to Predict Tumor DNA Accessibility and Immune Activity

TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile

Abstract 6371: Deep learning algorithm for multi-cancer detection and classification using cf-WGS

High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method

AIKYATAN: mapping distal regulatory elements using convolutional learning on GPU

Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling

NCNet: Deep Learning Network Models for Predicting Function of Non-coding DNA

Advanced deep-learning algorithm for multi-cancer detection using cf-WGS.

Deep Learning for identifying radiogenomic associations in breast cancer

RCANE: A Deep Learning Algorithm for Whole-genome Pan-Cancer Somatic Copy Number Aberration Prediction using RNA-seq Data

Crystallization and crystallographic data for new forms of thymidylate synthase from Lactobacillus casei.

GDCL-NcDA: identifying non-coding RNA-disease associations via contrastive learning between deep graph learning and deep matrix factorization

Predicting Enhancers with Deep Convolutional Neural Networks