A generalised protein identification method for novel and diverse sequencing technologies

Bikash Kumar Bhandari,Nick Goldman
DOI: https://doi.org/10.1101/2024.02.29.582769
2024-03-04
Abstract:Protein sequencing is a rapidly evolving field with much progress towards the realisation of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Using a hypothetical sequencing device that can simulate several novel devices, we show that on the human protein database (N=20,181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.
Bioinformatics
What problem does this paper attempt to address?