Abstract:MOTIVATION:Tandem mass spectrometry combined with sequence database searching is one of the most powerful tools for protein identification. As thousands of spectra are generated by a mass spectrometer in one hour, the speed of database searching is critical, especially when searching against a large sequence database, or when the peptide is generated by some unknown or non-specific enzyme, even or when the target peptides have post-translational modifications (PTM). In practice, about 70-90% of the spectra have no match in the database. Many believe that a significant portion of them are due to peptides of non-specific digestions by unknown enzymes or amino acid modifications. In another case, scientists may choose to use some non-specific enzymes such as pepsin or thermolysin for proteolysis in proteomic study, in that not all proteins are amenable to be digested by some site-specific enzymes, and furthermore many digested peptides may not fall within the rang of molecular weight suitable for mass spectrometry analysis. Interpreting mass spectra of these kinds will cost a lot of computational time of database search engines.OVERVIEW:The present study was designed to speed up the database searching process for both cases. More specifically speaking, we employed an approach combining suffix tree data structure and spectrum graph. The suffix tree is used to preprocess the protein sequence database, while the spectrum graph is used to preprocess the tandem mass spectrum. We then search the suffix tree against the spectrum graph for candidate peptides. We design an efficient algorithm to compute a matching threshold with some statistical significance level, e.g. p = 0.01, for each spectrum, and use it to select candidate peptides. Then we rank these peptides using a SEQUEST-like scoring function. The algorithms were implemented and tested on experimental data. For post-translational modifications, we allow arbitrary number of any modification to a protein.AVAILABILITY:The executable program and other supplementary materials are available online at: http://hto-c.usc.edu:8000/msms/suffix/.

Speeding Up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast Near Neighbor Search

Mining Mass Spectra: Metric Embeddings and Fast Near Neighbor Search

A Novel Spectral Library Workflow to Enhance Protein Identifications

Abstract P326: an Innovative Peptide Spectral Library Search Engine for Cardiovascular Proteomics

Algorithmic study on mass spectrometry and proteomics

A Suffix Tree Approach to the Interpretation of Tandem Mass Spectra: Applications to Peptides of Non-Specific Digestion and Post-Translational Modifications.

Accelerating open modification spectral library searching on tensor core in high-dimensional space

Towards Less Biased Data-driven Scoring with Deep Learning-Based End-to-end Database Search in Tandem Mass Spectrometry

Faster graphical model identification of tandem mass spectra using peptide word lattices

Two-step Spectral Library Pre-Search: A Novel Approach for Speeding Up Compound Identification

A Suboptimal Algorithm for De Novo Peptide Sequencing Via Tandem Mass Spectrometry

A learned score function improves the power of mass spectrometry database search

SimMS: A GPU-Accelerated Cosine Similarity implementation for Tandem Mass Spectrometry

A Novel Scoring Schema for Peptide Identification by Searching Protein Sequence Databases Using Tandem Mass Spectrometry Data

Fast mass spectrometry search and clustering of untargeted metabolomics data

Efficient discovery of abundant post-translational modifications and spectral pairs using peptide mass and retention time differences

Efficient Indexing of Peptides for Database Search Using Tide

Extended similarity methods for efficient data mining in imaging mass spectrometry

Massively Parallel Open Modification Spectral Library Searching with Hyperdimensional Computing

A Machine Learning Approach to Explore the Spectra Intensity Pattern of Peptides Using Tandem Mass Spectrometry Data

yHydra: Deep Learning enables an Ultra Fast Open Search by Jointly Embedding MS/MS Spectra and Peptides of Mass Spectrometry-based Proteomics