Revealing the hidden sequence distribution of epitope-specific TCR repertoires and its influence on machine learning model performance

Sofie Gielis,Maria Chernigovskaya,Milena Pavlovic,Vincent Van Deuren,Romi Vandoren,Sebastiaan Valkiers,Kris Laukens,Victor Greiff,Pieter Meysman
DOI: https://doi.org/10.1101/2024.10.21.619364
2024-10-24
Abstract:Numerous efforts have been made to decipher the epitope-T cell receptor (TCR) recognition code. Both simple machine learning techniques and deep learning strategies have been used to train models to predict the binding of epitopes by TCR sequences. A good training data set rests at the basis of every accurate prediction model, yet little attention has been given to the composition of these data sets. In this paper, we studied the natural distribution of TCR sequences within epitope-specific TCR repertoires, i.e. a set of TCRs binding the same epitope, and its impact on the predictability of TCR- epitope interactions. We found that the observed diversity of these repertoires can result from a smaller set of core binding motifs constrained by TCR generation. Moreover, a clear relationship was found between the sequence distribution of the training data and performance metrics, emphasizing the importance of the used ground-truth data when using machine learning models in this domain. Taken together, these findings inform data set composition to help push epitope-TCR prediction models to the next level.
Bioinformatics
What problem does this paper attempt to address?