Unipept in 2024: Expanding metaproteomics analysis with support for missed cleavages, semi-tryptic and non-tryptic peptides

Tibo Vande Moortele,Bram Devlaminck,Simon Van de Vyver,Tim Van Den Bossche,Lennart Martens,Peter Dawyndt,Bart Mesuere,Pieter Verschaffelt
DOI: https://doi.org/10.1101/2024.09.26.615136
2024-11-27
Abstract:Unipept, a pioneering software tool in metaproteomics, has significantly advanced the analysis of complex ecosystems by facilitating both taxonomic and functional insights from environmental samples. From the onset, Unipept's capabilities focused on tryptic peptides, utilizing the predictability and consistency of trypsin digestion to efficiently construct a protein reference database. However, the evolving landscape of proteomics and emerging fields like immunopeptidomics necessitate a more versatile approach that extends beyond the analysis of tryptic peptides. In this article, we present a significant update to the underlying index structure of Unipept, which is now powered by a Sparse Suffix Array index. This advancement enables the analysis of semi-tryptic peptides, peptides with missed cleavages, and non-tryptic peptides such as those encountered in other research fields such as immunopeptidomics (e.g. MHC- and HLA-peptides). This new index benefits all tools in the Unipept ecosystem such as the web application, desktop tool, API and command line interface. A benchmark study highlights significantly improved performance in handling missed cleavages, preserving the same level of accuracy.
Bioinformatics
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Handling missed cleavage sites**: - The traditional Unipept tool, by default, assumes no missed cleavages when processing peptides. However, in actual proteomics analysis, due to incomplete enzymatic cleavage efficiency, missed cleavage sites are often encountered. This causes these peptides to be unable to be correctly matched to proteins, affecting the accuracy of the analysis. - The paper proposes a new indexing structure, which supports the handling of missed cleavage sites through the Sparse Suffix Array (SSA), thereby improving the accuracy and performance of the analysis. 2. **Supporting semi - tryptic peptides and non - tryptic peptides**: - Traditionally, Unipept mainly processes tryptic peptides, that is, peptides with specific amino acids (such as lysine or arginine) at both ends. However, with the rise of fields such as immunopeptidomics, it is necessary to process more types of peptides, including semi - tryptic peptides and non - tryptic peptides. - The new indexing structure enables Unipept to process these more complex peptide types, expanding its application range. 3. **Improving performance**: - The traditional Unipept has a significant performance degradation when handling missed cleavage sites. For example, when processing a data set containing 24,424 peptides, the time for enabling missed cleavage site processing increases from 11 seconds to 6 minutes and 33 seconds. - By introducing the sparse suffix array and an optimized API implementation, the new version of Unipept has a 20 - 70 - fold performance improvement when handling missed cleavage sites, significantly shortening the analysis time. 4. **Memory optimization**: - The traditional Unipept depends on pre - calculated databases, which require a large amount of memory resources. For example, processing 130 million tryptic peptides requires at least 60GB of memory. - The new indexing structure significantly reduces memory usage through techniques such as sparsification, compression, and bit - packing. For example, the original dense suffix array requires 696GB of memory, while the optimized sparse suffix array only requires 133GB of memory. In summary, the main objective of this paper is to improve the performance and accuracy of Unipept in handling complex peptide types and missed cleavage sites and reduce memory usage by introducing new indexing structures and optimization techniques, enabling it to better meet the research requirements of modern proteomics and immunopeptidomics.