Abstract:Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systemcally varying the composition and size of the training set. We assessed the generated models' performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set's size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2–3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs' proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field. In recent years, the application of deep learning represented a breakthrough in the mass spectrometry (MS) field by improving the assignment of the correct sequence of amino acids from observable MS spectra without prior knowledge, also known as de novo MS-based peptide sequencing. However, like other modern neural networks, models do not generalize well enough as they perform poorly on highly varied N- and C-termini peptide test sets. To mitigate this generalizability problem, we conducted a systematic investigation to uncover the requirements for building generalized models and boosting the performance on the MS-based de novo peptide sequencing task. Several experiments confirmed that the training set's peptide diversity directly impacts the resulting model's generalizability. Data showed that the best models were the multienzyme models (MEMs), i.e., models trained from a compendium of highly diverse peptides, such as the one generated from digesting a broad of species samples with a group of proteases. The applicability of these MEMs was later established by fully de novo sequencing 8 of the ten polypeptide chains of five commercial antibodies and extracting over 10000 proving peptides.

Decoding the Impact of Neighboring Amino Acid on ESI-MS Intensity Output through Deep Learning

Decoding the impact of neighboring amino acids on ESI-MS intensity output through deep learning

AdaNovo: Adaptive De Novo Peptide Sequencing with Conditional Mutual Information

Predicting Peptide Ionization Efficiencies for Electrospray Ionization Mass Spectrometry Using Machine Learning

A Machine Learning Approach to Explore the Spectra Intensity Pattern of Peptides Using Tandem Mass Spectrometry Data

Deep learning the collisional cross sections of the peptide universe from a million experimental values

DeepIso: A Deep Learning Model for Peptide Feature Detection

Deep Learning Powers Protein Identification from Precursor MS Information

Test-Time Training for Deep MS/MS Spectrum Prediction Improves Peptide Identification.

Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics

AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics

Response of Peptide Intensity to Concentration in ESI-MS-based Proteome

Deep Learning Predicts Non-Normal Peptide FAIMS Mobility Distributions Directly from Sequence

Investigation of Noncovalent Interactions Between Peptides with Potential Intrinsic Sequence Patterns by Mass Spectrometry

Preliminary Esi-Ms and Maldi-Tof Analysis on Phosphorylated Tetrapeptides with Xaa-Pro Motif

A Novel Scoring Schema for Peptide Identification by Searching Protein Sequence Databases Using Tandem Mass Spectrometry Data

DPST: De Novo Peptide Sequencing with Amino-Acid-Aware Transformers

iAmideV-Deep: Valine Amidation Site Prediction in Proteins Using Deep Learning and Pseudo Amino Acid Compositions

DMSS: an Attention-Based Deep Learning Model for High-Quality Mass Spectrometry Prediction

Independent highly sensitive characterization of asparagine deamidation and aspartic acid isomerization by sheathless CZE-ESI-MS/MS.

Ion Mobility Coupled to a Time-of-Flight Mass Analyzer Combined With Fragment Intensity Predictions Improves Identification of Classical Bioactive Peptides and Small Open Reading Frame-Encoded Peptides