Abstract:Background: The data deluge can leverage sophisticated ML techniques for functionally annotating the regulatory non-coding genome. The challenge lies in selecting the appropriate classifier for the specific functional annotation problem, within the bounds of the hardware constraints and the model's complexity. In our system AIKYATAN, we annotate distal epigenomic regulatory sites, e.g., enhancers. Specifically, we develop a binary classifier that classifies genome sequences as distal regulatory regions or not, given their histone modifications' combinatorial signatures. This problem is challenging because the regulatory regions are distal to the genes, with diverse signatures across classes (e.g., enhancers and insulators) and even within each class (e.g., different enhancer sub-classes). Results: We develop a suite of ML models, under the banner AIKYATAN, including SVM models, random forest variants, and deep learning architectures, for distal regulatory element (DRE) detection. We demonstrate, with strong empirical evidence, deep learning approaches have a computational advantage. Plus, convolutional neural networks (CNN) provide the best-in-class accuracy, superior to the vanilla variant. With the human embryonic cell line H1, CNN achieves an accuracy of 97.9% and an order of magnitude lower runtime than the kernel SVM. Running on a GPU, the training time is sped up 21x and 30x (over CPU) for DNN and CNN, respectively. Finally, our CNN model enjoys superior prediction performance vis-'a-vis the competition. Specifically, AIKYATAN-CNN achieved 40% higher validation rate versus CSIANN and the same accuracy as RFECS. Conclusions: Our exhaustive experiments using an array of ML tools validate the need for a model that is not only expressive but can scale with increasing data volumes and diversity. In addition, a subset of these datasets have image-like properties and benefit from spatial pooling of features. Our AIKYATAN suite leverages diverse epigenomic datasets that can then be modeled using CNNs with optimized activation and pooling functions. The goal is to capture the salient features of the integrated epigenomic datasets for deciphering the distal (non-coding) regulatory elements, which have been found to be associated with functional variants. Our source code will be made publicly available at: https://bitbucket.org/cellsandmachines/aikyatan.

CNN-BLSTM based deep learning framework for eukaryotic kinome classification: An explainability based approach

ProtienCNN‐BLSTM: An efficient deep neural network with amino acid embedding‐based model of protein sequence classification and biological analysis

BBATProt: A Framework Predicting Biological Function with Enhanced Feature Extraction via Explainable Deep Learning

Using explainable machine learning to uncover the kinase–substrate interaction landscape

Optimizing protein sequence classification: integrating deep learning models with Bayesian optimization for enhanced biological analysis

Explaining Black-box Models for Biomedical Text Classification

AIKYATAN: mapping distal regulatory elements using convolutional learning on GPU

An Interpretable Convolutional Neural Network Framework for Analyzing Molecular Dynamics Trajectories: a Case Study on Functional States for G-Protein-Coupled Receptors

Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions

Unveiling Black-boxes: Explainable Deep Learning Models for Patent Classification

Distangling Biological Noise in Cellular Images with a focus on Explainability

Deep learning-based proteomics enables accurate classification of bulk and single-cell samples

Unveiling Molecular Moieties through Hierarchical Graph Explainability

Identification of Protein Lysine Crotonylation Sites by a Deep Learning Framework with Convolutional Neural Networks

Biophysical models of cis-regulation as interpretable neural networks

Pathologist-Like Explanations Unveiled: an Explainable Deep Learning System for White Blood Cell Classification

Explaining Deep Convolutional Neural Networks for Image Classification by Evolving Local Interpretable Model-agnostic Explanations

DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding

Deep Learning Methods for Protein Family Classification on PDB Sequencing Data

Knowledge-Based Analysis for Detecting Key Signaling Events from Time-Series Phosphoproteomics Data

The Development and Application of KinomePro-DL: A Deep Learning Based Online Small Molecule Kinome Selectivity Profiling Prediction Platform