Contrastive learning of T cell receptor representations

Yuta Nagano,Andrew Pyo,Martina Milighetti,James Henderson,John Shawe-Taylor,Benny Chain,Andreas Tiffeau-Mayer
2024-10-10
Abstract:Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand challenge in immunology. Despite advances in high-throughput assays, specificity-labelled TCR data remains sparse. In other domains, the pre-training of language models on unlabelled data has been successfully used to address data bottlenecks. However, it is unclear how to best pre-train protein language models for TCR specificity prediction. Here we introduce a TCR language model called SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors), capable of data-efficient transfer learning. Through our model, we introduce a novel pre-training strategy combining autocontrastive learning and masked-language modelling, which enables SCEPTR to achieve its state-of-the-art performance. In contrast, existing protein language models and a variant of SCEPTR pre-trained without autocontrastive learning are outperformed by sequence alignment-based methods. We anticipate that contrastive learning will be a useful paradigm to decode the rules of TCR specificity.
Biomolecules,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to use contrastive learning to improve the accuracy of T - cell receptor (TCR) specificity prediction, especially in the case of scarce data**. Specifically, the researchers focus on the major challenge in immunology of computationally predicting the interaction between TCRs and their ligands. Although high - throughput experimental techniques have made progress, TCR data labeled with specificity are still very limited. In other fields, language models pre - trained on unlabeled data have been successfully used to solve the data bottleneck problem. However, in TCR specificity prediction, it is not clear how to best pre - train protein language models. To solve this problem, the authors introduce a TCR language model named SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors), which can perform transfer learning efficiently. By combining automatic contrastive learning and masked - language modeling, SCEPTR achieves state - of - the - art performance. In contrast, existing protein language models and SCEPTR variants without using automatic contrastive learning perform poorly in the face of sequence alignment methods. ### Main contributions of the paper 1. **Proposed a new pre - training strategy**: Combining automatic contrastive learning and masked - language modeling makes SCEPTR perform well on TCR specificity prediction tasks. 2. **Verified the limitations of existing models**: Through benchmark tests, it was found that existing protein language models are inferior to sequence - alignment - based methods in few - shot settings. 3. **Showed the advantages of SCEPTR**: SCEPTR not only outperforms or is on a par with TCRdist in multiple benchmarks, but also performs better on single - chain TCR data. 4. **Explored the mechanism of action of contrastive learning**: Through information - theoretic analysis, it was revealed how SCEPTR embedding distances weight sequence similarity according to VDJ recombination bias. ### Formula presentation Some formulas involved in the paper, such as the contrastive loss function, can be represented in Markdown format as follows: The contrastive loss function is defined as: \[ L_{\text{contrastive}}(f)=\mathbb{E}_{(x, x^{+})\sim p_{\text{pos}}}\left[-\log\frac{\exp(f(x)^{\top}f(x^{+}))}{\exp(f(x)^{\top}f(x^{+}))+\sum_{i = 1}^{N}\exp(f(x)^{\top}f(y_{i}))}\right] \] where \(f:X\rightarrow S^{m - 1}\) is a trainable embedding mapping that maps sample observations from space \(X\) to the \(m\)-dimensional unit hypersphere \(S^{m - 1}\subset\mathbb{R}^{m}\), \(p_{\text{pos}}\) is the joint distribution of positive sample pairs, \(p_{\text{data}}\) is the overall data distribution, and \(N\in\mathbb{Z}^{+}\) is a fixed number of background samples. ### Summary This paper solves the problem of scarce data in TCR specificity prediction and significantly improves the prediction accuracy by introducing the SCEPTR model and its unique pre - training strategy.