Abstract:Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand challenge in immunology. Despite advances in high-throughput assays, specificity-labelled TCR data remains sparse. In other domains, the pre-training of language models on unlabelled data has been successfully used to address data bottlenecks. However, it is unclear how to best pre-train protein language models for TCR specificity prediction. Here we introduce a TCR language model called SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors), capable of data-efficient transfer learning. Through our model, we introduce a novel pre-training strategy combining autocontrastive learning and masked-language modelling, which enables SCEPTR to achieve its state-of-the-art performance. In contrast, existing protein language models and a variant of SCEPTR pre-trained without autocontrastive learning are outperformed by sequence alignment-based methods. We anticipate that contrastive learning will be a useful paradigm to decode the rules of TCR specificity.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to use contrastive learning to improve the accuracy of T - cell receptor (TCR) specificity prediction, especially in the case of scarce data**. Specifically, the researchers focus on the major challenge in immunology of computationally predicting the interaction between TCRs and their ligands. Although high - throughput experimental techniques have made progress, TCR data labeled with specificity are still very limited. In other fields, language models pre - trained on unlabeled data have been successfully used to solve the data bottleneck problem. However, in TCR specificity prediction, it is not clear how to best pre - train protein language models. To solve this problem, the authors introduce a TCR language model named SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors), which can perform transfer learning efficiently. By combining automatic contrastive learning and masked - language modeling, SCEPTR achieves state - of - the - art performance. In contrast, existing protein language models and SCEPTR variants without using automatic contrastive learning perform poorly in the face of sequence alignment methods. ### Main contributions of the paper 1. **Proposed a new pre - training strategy**: Combining automatic contrastive learning and masked - language modeling makes SCEPTR perform well on TCR specificity prediction tasks. 2. **Verified the limitations of existing models**: Through benchmark tests, it was found that existing protein language models are inferior to sequence - alignment - based methods in few - shot settings. 3. **Showed the advantages of SCEPTR**: SCEPTR not only outperforms or is on a par with TCRdist in multiple benchmarks, but also performs better on single - chain TCR data. 4. **Explored the mechanism of action of contrastive learning**: Through information - theoretic analysis, it was revealed how SCEPTR embedding distances weight sequence similarity according to VDJ recombination bias. ### Formula presentation Some formulas involved in the paper, such as the contrastive loss function, can be represented in Markdown format as follows: The contrastive loss function is defined as: \[ L_{\text{contrastive}}(f)=\mathbb{E}_{(x, x^{+})\sim p_{\text{pos}}}\left[-\log\frac{\exp(f(x)^{\top}f(x^{+}))}{\exp(f(x)^{\top}f(x^{+}))+\sum_{i = 1}^{N}\exp(f(x)^{\top}f(y_{i}))}\right] \] where \(f:X\rightarrow S^{m - 1}\) is a trainable embedding mapping that maps sample observations from space \(X\) to the \(m\)-dimensional unit hypersphere \(S^{m - 1}\subset\mathbb{R}^{m}\), \(p_{\text{pos}}\) is the joint distribution of positive sample pairs, \(p_{\text{data}}\) is the overall data distribution, and \(N\in\mathbb{Z}^{+}\) is a fixed number of background samples. ### Summary This paper solves the problem of scarce data in TCR specificity prediction and significantly improves the prediction accuracy by introducing the SCEPTR model and its unique pre - training strategy.

Contrastive learning of T cell receptor representations

Attention-aware contrastive learning for predicting T cell receptor-antigen binding specificity

Sequence-based TCR-Peptide Representations Using Cross-Epitope Contrastive Fine-tuning of Protein Language Models

Epitope-anchored contrastive transfer learning for paired CD8 cell receptor-antigen recognition

tcrLM: a lightweight protein language model for predicting T cell receptor and epitope binding specificity

T cell receptor binding prediction: A machine learning revolution

Enhancing TCR specificity predictions by combined pan- and peptide-specific training, loss-scaling, and sequence similarity integration

NetTCR-2.1: Lessons and guidance on how to develop models for TCR specificity predictions

TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences

Predicting Antigen Specificity of Single T Cells Based on TCR CDR 3 Regions

TCR clustering by contrastive learning on antigen specificity

T-Cell Receptor Cognate Target Prediction Based on Paired α and β Chain Sequence and Structural CDR Loop Similarities

MATE-Pred: Multimodal Attention-based TCR-Epitope interaction Predictor

Context-Aware Amino Acid Embedding Advances Analysis of TCR-Epitope Interactions

Active Learning Framework for Cost-Effective TCR-Epitope Binding Affinity Prediction

TULIP — a Transformer based Unsupervised Language model for Interacting Peptides and T-cell receptors that generalizes to unseen epitopes

Predicting TCR sequences for unseen antigen epitopes using structural and sequence features

TCRpred: incorporating T-cell receptor repertoire for clinical outcome prediction

Self-supervised learning of T cell receptor sequences exposes core properties for T cell membership

TCR-GPT: Integrating Autoregressive Model and Reinforcement Learning for T-Cell Receptor Repertoires Generation