GRATCR: epitope-specific T cell receptor sequence generation with data-efficient pre-trained models

Zhenghong Zhou,Junwei Chen,Shenggeng Lin,Liang Hong,Dong-Qing Wei,Yi Xiong
DOI: https://doi.org/10.1101/2024.07.21.604503
2024-07-23
Abstract:T cell receptors (TCRs) play a crucial role in numerous immunotherapies targeting tumor cells. However, their acquisition and optimization present significant challenges, involving laborious and time-consuming wet lab experimental resource. Deep generative model has demonstrated remarkable capabilities in functional protein sequence generation, offering a promising solution for enhancing the acquisition of specific TCR sequences. Here, we propose GRATCR, a framework incorporates two pre-trained modules through a novel "grafting" strategy, to de-novo generate TCR sequences targeting specific epitopes. Experimental results demonstrate that TCRs generated by GRATCR exhibit higher specificity toward desired epitopes and are more biologically functional compared with state-of-the-art model, by using significantly fewer training data. Additionally, the generated sequences display novelty compared to natural sequences, and the interpretability evaluation further confirmed that the model is capable of capturing important binding patterns. GRATCR is freely available at https://github.com/zhzhou23/GRATCR.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to efficiently generate T - cell receptor (TCR) sequences specific to a particular epitope?** Specifically, the authors propose a new framework named GRATCR, aiming to generate TCR sequences with high specificity and biological function from scratch through a data - efficient pre - training model. ### Problem Background T - cell receptors (TCRs) play a crucial role in immunotherapy, especially in the field of tumor treatment. However, obtaining and optimizing TCR sequences face enormous challenges. Traditional wet - lab methods are not only time - consuming but also costly. In recent years, deep - generation models have performed well in generating functional protein sequences, providing new solutions for improving the acquisition of specific TCR sequences. ### Main Problems 1. **Efficient Generation of TCR Sequences**: Existing methods usually require a large amount of training data, and the generated TCR sequences may not have sufficient specificity or functionality. 2. **Data Efficiency**: How to generate high - quality TCR sequences with limited training data. 3. **Biological Functionality and Novelty**: The generated TCR sequences should not only have biological functions but also be different from natural sequences to explore a larger TCR space. ### Solutions of GRATCR GRATCR solves the above problems in the following ways: - **New "Grafting" Strategy**: Connect two pre - training modules (Epitope - BERT and TCR - GPT) through a new strategy called "grafting", ensuring the efficiency and generation quality of the model. - **Data - Efficient**: GRATCR only uses 1.5 million epitopes and 3 million TCRs for pre - training, significantly reducing the required amount of data. - **Higher Specificity and Functionality**: Experimental results show that the TCR sequences generated by GRATCR are superior to existing models in terms of specificity and have better biological functions. - **Novelty**: The generated TCR sequences are novel compared to natural sequences and can explore a larger sequence space while maintaining biological relevance. ### Experimental Verification To verify the effectiveness of GRATCR, the authors used multiple classification models (such as ERGO, ATMTCR, TEPCAM) to evaluate whether the generated TCR sequences can specifically bind to the target epitope. The results show that the TCR sequences generated by GRATCR have an approximately 20% increase in binding probability compared to existing models and also perform better in terms of biological conservation. In conclusion, the main purpose of this paper is to solve the problems of data efficiency, specificity, functionality, and novelty in TCR sequence generation by proposing the GRATCR framework, thereby providing a more effective tool for immunotherapy.