Leveraging the Human Panproteome to Enhance Peptide and Protein Identification in Proteomics and Metaproteomics

Jamie Canderan,Ruoying Yuan,Haixu Tang,Yuzhen Ye
DOI: https://doi.org/10.1101/2024.11.25.625239
2024-11-26
Abstract:In this paper, we developed a novel approach to utilize the human pangenome to improve peptide and protein identification from proteomic data (MS/MS spectra). We propose a new data structure called panproteome graph (PPG), in which nodes are tryptic peptides, to represent the human pangenome. The PPG can be built in linear time and can be utilized via graph traversal using a depth-first search algorithm to generate potential peptides for peptide identification in proteomics. The PPG built using the 47 human proteomes from the Human Pangenome Reference Consortium (HPRC) coupled with UniProt human proteins resulted in more than 4.2M tryptic peptides, a 26% increase as compared to when only the UniProt proteins were included. Graph-based analysis of the PPG revealed a giant disconnected component with about 3M nodes, suggesting substantial sharing of tryptic peptides among proteins. We applied tryptic peptides derived from PPG to characterize three collections of human proteomic and metaproteomic datasets, and our results showed that by exploiting the human pangenome, we were able to increase the number of identified peptides on all datasets we tested (about 8% increase across all three collections). We also showed that using more complete human proteome would be useful for reducing potential misidentification of human peptides as microbial peptides, a problem that was previously studied but based on genomic sequencing data. Our tool for building PPG is available in a GitHub repo PPGpep, and PPG-derived tryptic peptides can be utilized by MetaProD, a pipeline for both human and bacterial peptide and protein identification from (meta)proteomics datasets.
Bioinformatics
What problem does this paper attempt to address?