Abstract:Accurate protein identification from mass spectrometry (MS) data is fundamental to unraveling the complex roles of proteins in biological systems, with peptide sequencing being a pivotal step in this process. The two main paradigms for peptide sequencing are database search, which matches experimental spectra with peptide sequences from databases, and sequencing, which infers peptide sequences directly from MS without relying on pre-constructed database. Although database search methods are highly accurate, they are limited by their inability to identify novel, modified, or mutated peptides absent from the database. In contrast, sequencing is adept at discovering novel peptides but often struggles with missing peaks issue, further leading to lower precision. We introduce SearchNovo, a novel framework that synergistically integrates the strengths of database search and sequencing to enhance peptide sequencing. SearchNovo employs an efficient search mechanism to retrieve the most similar peptide spectrum match (PSM) from a database for each query spectrum, followed by a fusion module that utilizes the reference peptide sequence to guide the generation of the target sequence. Furthermore, we observed that dissimilar (noisy) reference peptides negatively affect model performance. To mitigate this, we constructed pseudo reference PSMs to minimize their impact. Comprehensive evaluations on multiple datasets reveal that SearchNovo significantly outperforms state-of-the-art models. Also, analysis indicates that many retrieved spectra contain missing peaks absent in the query spectra, and the retrieved reference peptides often share common fragments with the target peptides. These are key elements in the recipe for the success of SearchNovo.

BIOINDEX:AN EFFICIENT INDEX FOR SIMILARITY QUERIES OF BIOLOGICAL SEQUENCES

BioSeg: a biological sequence data model

A Fast Improved Pattern Matching Algorithm for Biological Sequences

Searching by Index for Similar Sequences: the SEQR Algorithm

A Fast Exact Pattern Matching Algorithm for Biological Sequences

Indexing All Life's Known Biological Sequences

Accelerating Sequence Searching: Dimensionality Reduction Method

A Multiple Criteria Framework for 3D Protein Structure Similarity Retrieval

BioSearch: a Semantic Search Engine for Bio2RDF

BioTCM-SE: A Semantic Search Engine for the Information Retrieval of Modern Biology and Traditional Chinese Medicine

KSI：a DNA sequence matching library for terabyte scale bio-data

CIndex: compressed indexes for fast retrieval of FASTQ files

MSQ-Index: A Succinct Index for Fast Graph Similarity Search

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

Seqminer2: an efficient tool to query and retrieve genotypes for statistical genetics analyses from biobank scale sequence dataset

DNA Sequence Data Mining Technique

iSeq: An integrated tool to fetch public sequencing data

Similarity search for local protein structures at atomic resolution by exploiting a database management system

Influence Of Data Set Splitting Methods On Similarity Indexing Performance

Enhancing Example-Based Code Search with Functional Semantics.

Bridging the Gap between Database Search and De Novo Peptide Sequencing with SearchNovo