Abstract:In recent years, natural language processing (NLP) models have demonstrated remarkable capabilities in various domains beyond traditional text generation. In this work, we introduce PeptideGPT, a protein language model tailored to generate protein sequences with distinct properties: hemolytic activity, solubility, and non-fouling characteristics. To facilitate a rigorous evaluation of these generated sequences, we established a comprehensive evaluation pipeline consisting of ideas from bioinformatics to retain valid proteins with ordered structures. First, we rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins. Finally, we predict the structure using ESMFold and select the proteins with pLDDT values greater than 70 to ensure ordered structure. The properties of generated sequences are evaluated using task-specific classifiers - PeptideBERT and HAPPENN. We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation. Our experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP-based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics. Codes, models, and data used in this study are freely available at: <a class="link-external link-https" href="https://github.com/aayush-shah14/PeptideGPT" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to design peptide sequences with specific functional characteristics through Generative Pre - trained Transformers (GPT) and bioinformatics supervision. Specifically, the researchers developed a model named PeptideGPT, aiming to generate protein sequences with the following specific properties: hemolytic activity, solubility, non - adhesiveness, and non - hemolytic properties. These problems have important application values in fields such as drug discovery, targeted therapy, and environmental sustainability.
### Main problem analysis:
1. **Hemolytic activity**:
- **Definition**: Hemolytic proteins are substances that can cause red blood cells to rupture (hemolysis), releasing hemoglobin.
- **Application**: These proteins have important medical applications, especially in therapies that require targeted cell lysis.
2. **Non - hemolytic**:
- **Definition**: Non - hemolytic proteins refer to proteins that do not cause red blood cells to rupture.
- **Application**: These proteins can be used as therapeutic agents in medicine.
3. **Non - adhesiveness**:
- **Definition**: Non - adhesive proteins can resist non - specific adsorption or binding of other substances on their surfaces.
- **Application**: This is very important in medical implants and drug delivery systems, which can prevent the attachment of proteins and cells, thereby reducing the risks of biofilm formation, thrombosis, inflammation, and infection.
4. **Solubility**:
- **Definition**: Soluble proteins are crucial for structural biology and biochemical experiments because insoluble proteins may aggregate and lose their native conformation, activity, and function.
- **Application**: These proteins are very useful in drug development and biomolecular engineering.
### Solutions:
- **Model architecture**: PeptideGPT is fine - tuned based on ProtGPT2, using the GPT2 - large architecture, which contains 36 layers and a model dimension of 1280. The total number of model parameters is 738 million, and the Byte - Pair Encoding (BPE) tokenizer is used.
- **Dataset preparation**: For each property - related task, the researchers collected corresponding protein sequence datasets respectively, including hemolytic proteins, non - hemolytic proteins, non - adhesive proteins, and soluble proteins.
- **Generation and evaluation**:
- **Generation**: Protein sequences are generated by adjusting parameters such as repetition penalty, top - k sampling, and maximum sequence length.
- **Filtering**: The generated sequences are verified by bioinformatics to ensure that they are valid proteins and have an ordered structure.
- **Structure prediction**: The three - dimensional structure of the generated sequences is predicted using ESMFold, and the orderliness of their structures is evaluated by pLDDT values.
- **Property classification**: Specialized classification models (such as HAPPENN and PeptideBERT) are used to evaluate whether the generated sequences have the required properties.
### Results:
- **Accuracy**: The accuracies of PeptideGPT on hemolytic, non - hemolytic, non - adhesive, and solubility tasks are 76.26%, 72.46%, 78.84%, and 68.06% respectively.
- **Generation quality**: Approximately 24% of the generated sequences have pLDDT values higher than 70, indicating that they have stable and ordered structures.
### Conclusion:
PeptideGPT has demonstrated its effectiveness and potential in generating protein sequences with specific functional characteristics, especially in terms of hemolytic activity, solubility, non - adhesiveness, and non - hemolytic properties. These results not only verify the effectiveness of the model but also highlight the revolutionary potential of natural language processing - based techniques in the fields of computational protein design and bioengineering.