Genesis: A Modular Protein Language Modelling Approach to Immunogenicity Prediction

Hugh O'Brien,Max Salm,Laura T Morton,Maciej Szukszto,Felix O'Farrell,Charlotte Boulton,Laurence King,Supreet Kaur Bola,Pablo Becker,Andrew Craig,Morten Nielsen,Yardena Samuels,Charles Swanton,Marc R Mansour,Sine Reker Hadrup,Sergio Quezada
DOI: https://doi.org/10.1101/2024.05.22.595296
2024-05-26
Abstract:Neoantigen immunogenicity prediction is a highly challenging problem in the development of personalised medicines. Low reactivity rates in called neoantigens result in a difficult prediction scenario with limited training datasets. Here we describe Genesis, a modular protein language modelling approach to immunogenicity prediction for CD8+ reactive epitopes. Genesis comprises of a pMHC encoding module trained on three pMHC prediction tasks, an optional TCR encoding module and a set of context specific immunogenicity prediction head modules. Compared with state-of-the-art models for each task, Genesis' encoding module performs comparably or better on pMHC binding affinity, eluted ligand prediction and stability tasks. Genesis outperforms all compared models on pMHC immunogenicity prediction (Area under the receiver operating characteristic curve=0.619, average precision: 0.514), with a 7% increase in average precision compared to the next best model. Genesis shows further improved performance on immunogenicity prediction with the integration of TCR context information. Genesis performance is further analysed for interpretability, which locates areas of weakness found across existing immunogenicity models and highlight possible biases in public datasets.
Bioinformatics
What problem does this paper attempt to address?
The main focus of this paper is on predicting the immunogenicity of neoantigens in cancer immunotherapy. Currently, predicting new antigens for personalized therapy is a challenging task due to limited training data caused by low response rates. The researchers propose a modular protein language modeling approach called Genesis for predicting the immunogenicity of CD8+ T cell reactive epitopes. Genesis consists of three components: a pMHC encoding module, an optional TCR encoding module, and a set of context-specific immunogenicity prediction heads. This model performs comparably or better than state-of-the-art models on tasks such as pMHC binding affinity, ligand dissociation prediction, and stability. It also outperforms all benchmark models in pMHC immunogenicity prediction. When TCR context information is integrated, Genesis further improves its performance. Additionally, Genesis reveals weaknesses and potential data biases in existing immunogenicity models through interpretability analysis. By using the Transformer architecture, Genesis is able to incrementally learn multiple pMHC prediction tasks, providing high-quality encoding for downstream immunogenicity prediction. In independent cancer-specific datasets, Genesis achieves state-of-the-art levels in pMHC immunogenicity prediction and can be extended to include other features as more data becomes available. The paper also discusses TCR-specific prediction, indicating that although existing models have limited performance on unseen epitopes, integrating TCR information can improve immunogenicity prediction. Experimental results show that when patient TCR information is provided, Genesis's prediction performance is enhanced. In conclusion, Genesis is an innovative prediction tool aiming to optimize the immunogenicity prediction of new antigens through modular and iterative learning, with the potential to play a significant role in personalized cancer therapy.