De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments
Kevin Eloff,Konstantinos Kalogeropoulos,Oliver Morell,Amandla Mabona,Jakob Berg Jespersen,Wesley Williams,Sam P. B. van Beljouw,Marcin Skwark,Andreas Hougaard Laustsen,Stan J. J. Brouns,Anne Ljungers,Erwin M. Schoof,Jeroen Van Goey,Ulrich auf dem Keller,Karim Beguir,Nicolas Lopez Carranza,Timothy P. Jenkins
DOI: https://doi.org/10.1101/2023.08.30.555055
2024-03-04
Abstract:Bottom-up mass spectrometry-based proteomics is challenged by the task of identifying the peptide that generates a tandem mass spectrum. Traditional methods that rely on known peptide sequence databases are limited and may not be applicable in certain contexts. peptide sequencing, which assigns peptide sequences to the spectra without prior information, is valuable for various biological applications; yet, due to a lack of accuracy, it remains challenging to apply this approach in many situations. Here, we introduce InstaNovo, a transformer neural network with the ability to translate fragment ion peaks into the sequence of amino acids that make up the studied peptide(s). The model was trained on 28 million labelled spectra matched to 742k human peptides from the ProteomeTools project. We demonstrate that InstaNovo outperforms current state-of-the-art methods on benchmark datasets and showcase its utility in several applications. Building upon human intuition, we also introduce InstaNovo+, a multinomial diffusion model that further improves performance by iterative refinement of predicted sequences. Using these models, we could sequence antibody-based therapeutics with unprecedented coverage, discover novel peptides, and detect unreported organisms in different datasets, thereby expanding the scope and detection rate of proteomics searches. Finally, we could experimentally validate tryptic and non-tryptic peptides with targeted proteomics, demonstrating the fidelity of our predictions. Our models unlock a plethora of opportunities across different scientific domains, such as direct protein sequencing, immunopeptidomics, and exploration of the dark proteome.
Bioinformatics