Evaluating the utility of amino acid similarity-aware kmers to represent TCR repertoires for classification

Hannah Kockelbergh,Shelley C. Evans,Liam Brierley,Peter L. Green,Andrea L. Jorgensen,Elizabeth J. Soilleux,Anna Fowler
DOI: https://doi.org/10.1101/2024.12.06.626025
2024-12-10
Abstract:Insights gained through interpretation of models trained on the T-cell receptor (TCR) repertoire to infer presence of immune-mediated conditions could contribute to advances in understanding of disease. This may lead to improved diagnostic tests and treatments for immune-mediated conditions, particularly autoimmune diseases. However, TCR repertoire datasets with known autoimmune disease status labels generally include orders of magnitude fewer samples than TCR sequences. Promising TCR repertoire classification approaches consider relationships between non-identical TCR sequences. In particular, kmer methods demonstrate strong and stable performance for small datasets. We propose a TCR repertoire representation that consider the relationships between amino acids within kmers in a flexible and efficient manner, which is evaluated in comparison to existing methods. XGBoost models are trained and tested on kmer representations of TCR repertoire datasets including samples from patients with coeliac disease as well as participants with previous cytomegalovirus infection. We show that kmers that use small representative alphabets of amino acids are capable of training models that perform similarly or better than kmers based on all 20 amino acids. We find that, for cytomegalovirus infection status classification, defining amino acid relationships using BLOSUM62 can lead to a model with stronger performance as compared to an Atchley factor definition. Finally, we detail kmers or motifs which are important in each classification model and highlight the challenge of training truly interpretable TCR repertoire classification models which, if overcome, could lead to biomarker discovery for autoimmune diseases.
Immunology
What problem does this paper attempt to address?