Deep Learning-Based Classification of CRISPR Loci Using Repeat Sequences

Xingyu Liao,Yanyan Li,Yingfu Wu,Xingyi Li,Xuequn Shang
DOI: https://doi.org/10.1101/2024.06.27.601093
2024-07-01
Abstract:With the widespread application of the CRISPR-Cas system in gene editing and related fields, the demand for detecting and classifying CRISPR-Cas systems in metagenomic data has continuously increased. The traditional classification of the CRISPR-Cas system mainly relies on identifying neighboring cas genes of repeats. However, in some cases where there is a lack of information about cas genes, such as in metagenomes and fragmented genome assemblies, traditional classification methods may become ineffective. Here, we introduce a deep learning-based method called CRISPRclassify-CNN-Att, which classifies CRISPR-Cas systems solely based on repeat sequences. CRISPRclassify-CNN-Att utilizes convolutional neural networks (CNNs) and self-attention mechanisms to extract features from repeat sequences. It employs a stacking strategy to handle sample imbalances across different subtypes and improves classification accuracy for subtypes with fewer samples through transfer learning. CRISPRclassify-CNN-Att demonstrates excellent performance in classifying multiple subtypes, particularly in subtypes with a larger number of samples. Although CRISPR loci classification primarily relies on cas genes, CRISPRclassify-CNN-Att offers a new approach as a significant complement to current methods. It can identify unclassified loci missed by traditional cas-based methods, breaking the limitations of traditional approaches, and simplifying the classification process. The proposed tool is freely accessible via https://github.com/Xingyu-Liao/CRISPRclassify-CNN-Att .
Bioinformatics
What problem does this paper attempt to address?
This paper presents a method based on deep learning, called CRISPRclassify-CNN-Att, for classifying CRISPR-Cas systems solely based on repeat sequences. Traditional classification methods mainly rely on identifying cas genes near the repeat sequences, but these methods may fail when cas gene information is lacking (e.g., in metagenomic data or fragmented genome assemblies). CRISPRclassify-CNN-Att utilizes convolutional neural networks (CNN) and self-attention mechanisms to extract features from the repeat sequences, and employs a stacking strategy to address the sample imbalance issue among different subtypes. Transfer learning is used to improve the classification accuracy of minority sample subtypes. This method performs well in classifying multiple subtypes, especially in subtypes with a larger number of samples. The paper points out that although the classification of CRISPR loci is primarily dependent on cas genes, CRISPRclassify-CNN-Att provides an important supplement to current methods by identifying unclassified loci that traditional cas gene methods may miss, thereby overcoming the limitations of traditional methods and simplifying the classification process. The tool is publicly available on GitHub. The study found that features such as repeat sequences, k-mer frequencies, GC content, and sequence length significantly affect the model's performance. Analysis of key k-mers revealed the sequence specificity of different subtypes, which contributes to understanding the functionality and mechanism of CRISPR-Cas systems. Furthermore, the model is capable of handling situations where cas gene information is unavailable, thus improving classification efficiency and accuracy.