Abstract:Unipept, a pioneering software tool in metaproteomics, has significantly advanced the analysis of complex ecosystems by facilitating both taxonomic and functional insights from environmental samples. From the onset, Unipept's capabilities focused on tryptic peptides, utilizing the predictability and consistency of trypsin digestion to efficiently construct a protein reference database. However, the evolving landscape of proteomics and emerging fields like immunopeptidomics necessitate a more versatile approach that extends beyond the analysis of tryptic peptides. In this article, we present a significant update to the underlying index structure of Unipept, which is now powered by a Sparse Suffix Array index. This advancement enables the analysis of semi-tryptic peptides, peptides with missed cleavages, and non-tryptic peptides such as those encountered in other research fields such as immunopeptidomics (e.g. MHC- and HLA-peptides). This new index benefits all tools in the Unipept ecosystem such as the web application, desktop tool, API and command line interface. A benchmark study highlights significantly improved performance in handling missed cleavages, preserving the same level of accuracy.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Handling missed cleavage sites**: - The traditional Unipept tool, by default, assumes no missed cleavages when processing peptides. However, in actual proteomics analysis, due to incomplete enzymatic cleavage efficiency, missed cleavage sites are often encountered. This causes these peptides to be unable to be correctly matched to proteins, affecting the accuracy of the analysis. - The paper proposes a new indexing structure, which supports the handling of missed cleavage sites through the Sparse Suffix Array (SSA), thereby improving the accuracy and performance of the analysis. 2. **Supporting semi - tryptic peptides and non - tryptic peptides**: - Traditionally, Unipept mainly processes tryptic peptides, that is, peptides with specific amino acids (such as lysine or arginine) at both ends. However, with the rise of fields such as immunopeptidomics, it is necessary to process more types of peptides, including semi - tryptic peptides and non - tryptic peptides. - The new indexing structure enables Unipept to process these more complex peptide types, expanding its application range. 3. **Improving performance**: - The traditional Unipept has a significant performance degradation when handling missed cleavage sites. For example, when processing a data set containing 24,424 peptides, the time for enabling missed cleavage site processing increases from 11 seconds to 6 minutes and 33 seconds. - By introducing the sparse suffix array and an optimized API implementation, the new version of Unipept has a 20 - 70 - fold performance improvement when handling missed cleavage sites, significantly shortening the analysis time. 4. **Memory optimization**: - The traditional Unipept depends on pre - calculated databases, which require a large amount of memory resources. For example, processing 130 million tryptic peptides requires at least 60GB of memory. - The new indexing structure significantly reduces memory usage through techniques such as sparsification, compression, and bit - packing. For example, the original dense suffix array requires 696GB of memory, while the optimized sparse suffix array only requires 133GB of memory. In summary, the main objective of this paper is to improve the performance and accuracy of Unipept in handling complex peptide types and missed cleavage sites and reduce memory usage by introducing new indexing structures and optimization techniques, enabling it to better meet the research requirements of modern proteomics and immunopeptidomics.

Unipept in 2024: Expanding metaproteomics analysis with support for missed cleavages, semi-tryptic and non-tryptic peptides

Biodiversity analysis of metaproteomics samples with Unipept: a comprehensive tutorial

ProteinInferencer: Confident protein identification and multiple experiment comparison for large scale proteomics projects

Splanchnic circulatory responses to ouabain in shock.

Peptimetric: Quantifying and Visualizing Differences in Peptidomic Data

UniSpec: Deep Learning for Predicting the Full Range of Peptide Fragment Ion Series to Enhance the Proteomics Data Analysis Workflow

Metaproteomics beyond databases: addressing the challenges and potentials of de novo sequencing

Universal toolset for mass spectrometric analysis of intracellular peptidome and small protein fraction

AlphaPept: a modern and open framework for MS-based proteomics

Exploring the dynamic landscape of immunopeptidomics: Unravelling posttranslational modifications and navigating bioinformatics terrain

Maximizing immunopeptidomics-based bacterial epitope discovery by multiple search engines and rescoring

Accelerating Proteomics Using Broad Specificity Proteases

Leveraging the Human Panproteome to Enhance Peptide and Protein Identification in Proteomics and Metaproteomics

A Fast Peptide Match Service for Uniprot Knowledgebase

AI-Assisted Processing Pipeline to Boost Protein Isoform Detection

SpirPep: an in silico digestion-based platform to assist bioactive peptides discovery from a genome-wide database

Micropillar arrays, wide window acquisition and AI-based data analysis improve comprehensiveness in multiple proteomic applications

Peptide clustering enhances large-scale analyses and reveals proteolytic signatures in mass spectrometry data

PepQuery2 democratizes public MS proteomics data for rapid peptide searching

AIomics: exploring more of the proteome using mass spectral libraries extended by AI

PepSIRF: a flexible and comprehensive tool for the analysis of data from highly-multiplexed DNA-barcoded peptide assays