Restoring data balance via generative models of T-cell receptors for antigen-binding prediction

Emanuele Loffredo,Mauro Pastore,Simona Cocco,Rémi Monasson
DOI: https://doi.org/10.1101/2024.07.10.602897
2024-07-15
Abstract:Unveiling the specificity in T-cell-receptor and antigen recognition represents a major step to understand the immune system response. Many supervised machine learning approaches have been designed to build sequence-based predictive models of such specificity using binding and non-binding examples of data. Due to the presence of few specific and many non-specific T-cell receptors for each antigen, available datasets are heavily imbalanced and make the goal of achieving solid predictive performances very challenging. Here, we propose to restore data balance through data augmentation using generative unsupervised models. We then use these augmented data to train supervised models for prediction of peptide-specific T-cell receptors and binding pairs of peptide and T-cell receptors sequences. We show that our pipeline yields increased performance in terms of T-cell receptors specificity prediction tasks. More broadly, our work provides a general framework to restore balance in computational problems involving biological sequence data.
Bioinformatics
What problem does this paper attempt to address?