Abstract:Function or disfunction of proteins depends on the primary structures, and protein sequencing, which provides key information on protein related biological processes and disease, plays important roles in biological, biomedical, clinical research and application. To obtain the precise protein sequences, researchers developed different methods over the past few decades, and these methods include conventional methods and newly methods. The former includes Edman degradation and mass spectrometry (MS), and the latter includes single-molecule detection, nanopore and other lately developed techniques. In the 1960s, the classic Edman degradation was firstly developed for sequencing protein molecules from N-terminus using cyclic chemical reaction. Afterwards, solid-state, and gas-state Edman degradation was further developed that still plays a significant role in the modern technologies. This review discusses the principle and limits of Edman degradation. Moreover, we discussed advantages and shortcomings of MS-based approaches, which are the current standard methods for protein sequencing applications. Single-molecule approaches could bring revolution in proteomics, realizing high sensitivity for the low-abundance protein detection and single-cell proteomics. With the development of the single-molecule nucleic acid sequencing, four kinds of basic groups of DNA/RNA can be effectively detected using label-free or fluorescence labelling strategies. However, it is still a challenge to label and analyze all twenty kinds of amino acid residues. Moreover, sensitive optical detection has been utilized for high throughput protein sequencing using fluorescence labelling. In this approach, selected residues of peptides were labelled, and the C-terminus was anchored onto the glass substrate. N-terminus was degraded through Edman cycles. Finally, the sequence can be analyzed through the wide-field fluorescence signals. This method has potential of large-scale, sensitive, and parallel detection. We have discussed its principle and characteristic features in detail. Nanopore, including biological nanopore and solid-state nanopore, has been emerged as powerful technologies for protein sequencing. Nanopore can provide single-molecule sensing interface and controlled nano-confined space enabling ultimate sensitivity and high spatiotemporal resolution. The mechanism of nanopore-based technologies depends on the interaction of functional group and the nanopore, inducing the current modulations. The information of peptides can be obtained by monitoring the ionic current responses. Arrayed nanopores have potential of high-throughput detection at lowabundance. It is still in early stage of development and some challenges need to be addressed. As "finger-print" signal, Raman spectrum is an ideal candidate for protein sequencing. However, very weak signals can significantly restrict its application, especially at low concentration of target molecule. Surface enhanced Raman spectroscopy (SERS) can enhance the Raman signal to achieve the detection on the scale of a single molecule. Combination of the SERS and nanopore has demonstrated powerful capability of label-free detection of ten kinds of amino acids. Moreover, this method offers a new strategy for protein sequencing. Comparing with the weak Raman signal, fluorescence signals are more accessible, even on the level of single molecule. Several molecular dynamics (MD) simulations have been discussed to show possibility of fluorescence labelled protein sequencing within nanopore. Nevertheless, some drawbacks need to be addressed, especially the high-cost fabrication of nanopore and translocation of proteins through a pore. Specifically, this review also discusses the future challenges as well as summarize recent efforts to break the bottleneck of the current protein sequencing, promoting development of medical treatment, disease diagnosis and related fields.

A generalised protein identification method for novel and diverse sequencing technologies

A generalized protein identification method for novel and diverse sequencing technologies

ProteinInferencer: Confident protein identification and multiple experiment comparison for large scale proteomics projects

Machine learning-aided protein identification from multidimensional signatures

High Accuracy Protein Identification: Fusion of solid-state nanopore sensing and machine learning

A minimalist binary/digital approach to large-scale single molecule protein identification with optically labeled tRNAs and multiple carboxypeptidases and its extension to peptide sequencing

Protein identification with deep learning: from abc to xyz

Recent Advances in Protein Sequencing

Distinguishing Proteins From Arbitrary Amino Acid Sequences

Amplifiable protein identification via residue-resolved barcoding and composition code counting

Real-time dynamic single-molecule protein sequencing on an integrated semiconductor device

Protein Sequencing with an Adaptive Genetic Algorithm from Tandem Mass Spectrometry

A nested mixture model for protein identification using mass spectrometry

Whole protein sequencing and quantification without proteolysis, terminal residue cleavage, or purification: A computational model

A protein sequence fitness function for identifying natural and nonnatural proteins

Automated protein identification by tandem mass spectrometry: Issues and strategies

Binomial probability distribution model-based protein identification algorithm for tandem mass spectrometry utilizing peak intensity information.

Identifying Novel Protein Phenotype Annotations by Hybridizing Protein–protein Interactions and Protein Sequence Similarities

Mass spectrometry based protein identification with accurate statistical significance assignment

Peptide Sequencing Via Protein Language Models

Highly Robust de Novo Full-Length Protein Sequencing