Discovering CRISPR-Cas system with self-processing pre-crRNA capability by foundation models

Wenhui Li,Xianyue Jiang,Wuke Wang,Liya Hou,Runze Cai,Yongqian Li,Qiuxi Gu,Guohui Chuai,Qinchang Chen,Peixiang Ma,Jin Tang,Menghao Guo,Xingxu Huang,Jun Zhang,Qi Liu
DOI: https://doi.org/10.1101/2024.03.11.583506
2024-03-11
Abstract:The discovery and functional annotation of CRISPR-Cas systems laid the groundwork for the development of novel CRISPR-based gene editing tools. Traditional similarity- search-based Cas discovery strategies, which rely heavily on local sequence alignment and reference Cas homologs, may overlook a significant number of remote homologs with limited sequence similarity; and it can not be applied directly for functional recognition. With the rapid development of protein large language models (LLMs), protein foundation models are expected to help model Cas systems with limited Cas homologs without extensive task-specific training data; however, the full potential of these models for Cas discovery and functional annotation has yet to be determined. To this end, we present a novel, effective and unified AI framework, CHOOSER ( as mlog bserving and lf-processing sc eening), for alignment-free discovery of novel CRISPR-Cas systems with self-processing precursor CRISPR RNA (pre-crRNA) capability utilizing protein foundation models. CHOOSER successfully retrieved 11 novel homologs of Casλ, the majority of which are predicted to be able to self-process pre-crRNA, nearly doubling the current catalog. One of the candidates, EphcCasλ, was subsequently experimentally validated for its ability to self-process pre-crRNA, target DNA cleavage, and trans-cleavage and was shown to be a promising candidate for use as a CRISPR-Cas-based pathogen detection system. Overall, our study provides an unprecedented perspective and methodology for discovering novel CRISPR-Cas systems with specific functions using foundation models, underscoring the potential for transforming newly identified Cas homologs into genetic editing tools.
Bioinformatics
What problem does this paper attempt to address?