Abstract:Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside of their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors.

Revealing the hidden sequence distribution of epitope-specific TCR repertoires and its influence on machine learning model performance

Predicting Antigen Specificity of Single T Cells Based on TCR CDR 3 Regions

T cell receptor binding prediction: A machine learning revolution

SETE: Sequence-based Ensemble learning approach for TCR Epitope binding prediction

[EEG findings in subjects of enuresis (56 cases)].

Computational analysis of epitope-specific T-cell repertoires.

Deep learning predictions of TCR-epitope interactions reveal epitope-specific chains in dual alpha T cells

Benchmarking of T-Cell Receptor - Epitope Predictors with ePytope-TCR

Enhancing TCR specificity predictions by combined pan- and peptide-specific training, loss-scaling, and sequence similarity integration

NetTCR: sequence-based prediction of TCR binding to peptide-MHC complexes using convolutional neural networks

Predicting TCR-Epitope Binding Specificity Using Deep Metric Learning and Multimodal Learning

Sequence-based TCR-Peptide Representations Using Cross-Epitope Contrastive Fine-tuning of Protein Language Models

Predicting TCR sequences for unseen antigen epitopes using structural and sequence features

Disease associated human TCR characterization by deep-learning framework TCR-DeepInsight

tcrLM: a lightweight protein language model for predicting T cell receptor and epitope binding specificity

Characterizing the interaction conformation between T-cell receptors and epitopes with deep learning

Mitochondrial DNA disease masquerading as age-related macular degeneration

Assessing the Generalization Capabilities of TCR Binding Predictors via Peptide Distance Analysis

MATE-Pred: Multimodal Attention-based TCR-Epitope interaction Predictor

TCRpred: incorporating T-cell receptor repertoire for clinical outcome prediction

Detecting T-cell receptors involved in immune responses from single repertoire snapshots