Deep Learning-Based Classification of CRISPR Loci Using Repeat Sequences

Xingyu Liao,Yanyan Li,Yingfu Wu,Xingyi Li,Xuequn Shang

DOI: https://doi.org/10.1101/2024.06.27.601093

2024-07-01

Abstract:With the widespread application of the CRISPR-Cas system in gene editing and related fields, the demand for detecting and classifying CRISPR-Cas systems in metagenomic data has continuously increased. The traditional classification of the CRISPR-Cas system mainly relies on identifying neighboring cas genes of repeats. However, in some cases where there is a lack of information about cas genes, such as in metagenomes and fragmented genome assemblies, traditional classification methods may become ineffective. Here, we introduce a deep learning-based method called CRISPRclassify-CNN-Att, which classifies CRISPR-Cas systems solely based on repeat sequences. CRISPRclassify-CNN-Att utilizes convolutional neural networks (CNNs) and self-attention mechanisms to extract features from repeat sequences. It employs a stacking strategy to handle sample imbalances across different subtypes and improves classification accuracy for subtypes with fewer samples through transfer learning. CRISPRclassify-CNN-Att demonstrates excellent performance in classifying multiple subtypes, particularly in subtypes with a larger number of samples. Although CRISPR loci classification primarily relies on cas genes, CRISPRclassify-CNN-Att offers a new approach as a significant complement to current methods. It can identify unclassified loci missed by traditional cas-based methods, breaking the limitations of traditional approaches, and simplifying the classification process. The proposed tool is freely accessible via https://github.com/Xingyu-Liao/CRISPRclassify-CNN-Att .

Bioinformatics

What problem does this paper attempt to address?

This paper presents a method based on deep learning, called CRISPRclassify-CNN-Att, for classifying CRISPR-Cas systems solely based on repeat sequences. Traditional classification methods mainly rely on identifying cas genes near the repeat sequences, but these methods may fail when cas gene information is lacking (e.g., in metagenomic data or fragmented genome assemblies). CRISPRclassify-CNN-Att utilizes convolutional neural networks (CNN) and self-attention mechanisms to extract features from the repeat sequences, and employs a stacking strategy to address the sample imbalance issue among different subtypes. Transfer learning is used to improve the classification accuracy of minority sample subtypes. This method performs well in classifying multiple subtypes, especially in subtypes with a larger number of samples. The paper points out that although the classification of CRISPR loci is primarily dependent on cas genes, CRISPRclassify-CNN-Att provides an important supplement to current methods by identifying unclassified loci that traditional cas gene methods may miss, thereby overcoming the limitations of traditional methods and simplifying the classification process. The tool is publicly available on GitHub. The study found that features such as repeat sequences, k-mer frequencies, GC content, and sequence length significantly affect the model's performance. Analysis of key k-mers revealed the sequence specificity of different subtypes, which contributes to understanding the functionality and mechanism of CRISPR-Cas systems. Furthermore, the model is capable of handling situations where cas gene information is unavailable, thus improving classification efficiency and accuracy.

Deep Learning-Based Classification of CRISPR Loci Using Repeat Sequences

Dynamic Imaging of Genomic Loci in Living Human Cells by an Optimized CRISPR/Cas System

CRISPRcasIdentifier: Machine learning for accurate identification and classification of CRISPR-Cas systems

CRISPRidentify: identification of CRISPR arrays using machine learning approach

CrnnCrispr: An Interpretable Deep Learning Method for CRISPR/Cas9 sgRNA On-Target Activity Prediction

Versatile Detection with CRISPR/Cas System from Applications to Challenges

Annotation and Classification of CRISPR-Cas Systems

CRISPRdisco: An Automated Pipeline for the Discovery and Analysis of CRISPR-Cas Systems

Deep learning improves the ability of sgRNA off-target propensity prediction

CRISPRlnc: a machine learning method for lncRNA-specific single-guide RNA design of CRISPR/Cas9 system

[Prediction of CRISPR/Cas9 off-target activity using multi-scale convolutional neural network]

CRISPRCasTyper: An automated tool for the identification, annotation and classification of CRISPR-Cas loci

DeepFM-Crispr: Prediction of CRISPR On-Target Effects via Deep Learning

The CRISPR/Cas System: A Customizable Toolbox for Molecular Detection

Application Research Progress in Biosensing and Bioimaging Based on CRISPR-Cas System

Transformer-Based Deep Learning Model with Latent Space Regularization for CRISPR-Cas Protein Sequence Classification

Discovering CRISPR-Cas system with self-processing pre-crRNA capability by foundation models

CRISPR/Cas Systems towards Next-Generation Biosensing

Classification of Noncoding RNA Elements Using Deep Convolutional Neural Networks

Modeling CRISPR-Cas13d on-target and off-target effects using machine learning approaches