Abstract:MOTIVATION:Tandem mass spectrometry combined with sequence database searching is one of the most powerful tools for protein identification. As thousands of spectra are generated by a mass spectrometer in one hour, the speed of database searching is critical, especially when searching against a large sequence database, or when the peptide is generated by some unknown or non-specific enzyme, even or when the target peptides have post-translational modifications (PTM). In practice, about 70-90% of the spectra have no match in the database. Many believe that a significant portion of them are due to peptides of non-specific digestions by unknown enzymes or amino acid modifications. In another case, scientists may choose to use some non-specific enzymes such as pepsin or thermolysin for proteolysis in proteomic study, in that not all proteins are amenable to be digested by some site-specific enzymes, and furthermore many digested peptides may not fall within the rang of molecular weight suitable for mass spectrometry analysis. Interpreting mass spectra of these kinds will cost a lot of computational time of database search engines.OVERVIEW:The present study was designed to speed up the database searching process for both cases. More specifically speaking, we employed an approach combining suffix tree data structure and spectrum graph. The suffix tree is used to preprocess the protein sequence database, while the spectrum graph is used to preprocess the tandem mass spectrum. We then search the suffix tree against the spectrum graph for candidate peptides. We design an efficient algorithm to compute a matching threshold with some statistical significance level, e.g. p = 0.01, for each spectrum, and use it to select candidate peptides. Then we rank these peptides using a SEQUEST-like scoring function. The algorithms were implemented and tested on experimental data. For post-translational modifications, we allow arbitrary number of any modification to a protein.AVAILABILITY:The executable program and other supplementary materials are available online at: http://hto-c.usc.edu:8000/msms/suffix/.

Algorithmic study on mass spectrometry and proteomics

AdaNovo: Adaptive De Novo Peptide Sequencing with Conditional Mutual Information

A Novel Spectral Library Workflow to Enhance Protein Identifications

ProteinInferencer: Confident protein identification and multiple experiment comparison for large scale proteomics projects

A Suffix Tree Approach to the Interpretation of Tandem Mass Spectra: Applications to Peptides of Non-Specific Digestion and Post-Translational Modifications.

Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis

De Novo Sequencing of Peptides from Tandem Mass Spectra and Applications in Proteogenomics

A Suboptimal Algorithm for De Novo Peptide Sequencing Via Tandem Mass Spectrometry

Binomial probability distribution model-based protein identification algorithm for tandem mass spectrometry utilizing peak intensity information.

Efficient discovery of abundant post-translational modifications and spectral pairs using peptide mass and retention time differences

AIomics: exploring more of the proteome using mass spectral libraries extended by AI

MSNovo: a dynamic programming algorithm for de novo peptide sequencing via tandem mass spectrometry.

A Novel Scoring Schema for Peptide Identification by Searching Protein Sequence Databases Using Tandem Mass Spectrometry Data

Speeding Up Tandem Mass Spectrometry Database Search: Metric Embeddings and Fast Near Neighbor Search

Algorithms for Identifying Protein Cross-Links Via Tandem Mass Spectrometry

A Dynamic Programming Approach to De Novo Peptide Sequencing Via Tandem Mass Spectrometry

Mass spectrometry‐based high‐throughput proteomics and its role in biomedical studies and systems biology

Enhancing TOF/TOF-based de novo sequencing capability for high throughput protein identification with amino acid-coded mass tagging.

Large-scale protein identification using mass spectrometry

Algorithms for de-novo sequencing of peptides by tandem mass spectrometry: A review

Mass spectrometry-intensive top-down proteomics: an update on technology advancements and biomedical applications