Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences

Jeffrey A. Ruffolo,Stephen Nayfach,Joseph Gallagher,Aadyot Bhatnagar,Joel Beazer,Riffat Hussain,Jordan Russ,Jennifer Yip,Emily Hill,Martin Pacesa,Alexander J. Meeske,Peter Cameron,Ali Madani
DOI: https://doi.org/10.1101/2024.04.22.590591
2024-04-22
Abstract:Gene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as human cells. Artificial intelligence (AI) enabled design provides a powerful alternative with potential to bypass evolutionary constraints and generate editors with optimal properties. Here, using large language models (LLMs) trained on biological diversity at scale, we demonstrate the first successful precision editing of the human genome with a programmable gene editor designed with AI. To achieve this goal, we curated a dataset of over one million CRISPR operons through systematic mining of 26 terabases of assembled genomes and meta-genomes. We demonstrate the capacity of our models by generating 4.8x the number of protein clusters across CRISPR-Cas families found in nature and tailoring single-guide RNA sequences for Cas9-like effector proteins. Several of the generated gene editors show comparable or improved activity and specificity relative to SpCas9, the prototypical gene editing effector, while being 400 mutations away in sequence. Finally, we demonstrate an AI-generated gene editor, denoted as OpenCRISPR-1, exhibits compatibility with base editing. We release OpenCRISPR-1 publicly to facilitate broad, ethical usage across research and commercial applications.
Synthetic Biology
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to utilize artificial intelligence (especially large language models) to design powerful gene editors to overcome the functional deficiencies of existing CRISPR-Cas systems when applied in non-native environments (such as human cells). Specifically, the goals of the paper include: 1. **Generating diverse CRISPR-Cas proteins**: By using large-scale data mining and machine learning methods, generate a large number of novel CRISPR-Cas proteins that are significantly different in sequence from natural proteins but still functional. 2. **Improving the activity and specificity of gene editors**: Design gene editors that exhibit activity and specificity in human cells comparable to or better than SpCas9, while also being compatible with other functions such as base editing. 3. **Validating the functionality of generated gene editors**: Experimentally validate the actual editing effects of the generated gene editors in human cells to ensure their efficiency and specificity at different targets. The paper achieves these goals through the following steps: 1. **Data collection and preprocessing**: Mining over 1 million CRISPR-Cas operons from 26 terabytes of assembled genomes and metagenomes, constructing a CRISPR-Cas map. 2. **Model training and generation**: Using large language models (LLMs) to train on the CRISPR-Cas map and generate 4 million CRISPR-Cas protein sequences. 3. **Sequence classification and screening**: Classifying and screening the generated sequences using BLAST and HMM to ensure the generated sequences belong to specific CRISPR-Cas families. 4. **Structure prediction and functional validation**: Using AlphaFold2 to predict the structure of the generated proteins and experimentally validating their editing efficiency and specificity in human cells. Ultimately, the paper demonstrates that the generated gene editor OpenCRISPR-1 exhibits high activity and specificity in human cells, providing new possibilities for the development of gene editing technology.