Abstract:Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. However, training these models from scratch requires significant computational resources, limiting their accessibility. To address this, we integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks. We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks. Additionally, we present an exploration of generating plastic-degrading enzyme candidates using the model's embeddings and latent space manipulation techniques. While the results suggest that further refinement is needed, this approach provides a foundation for future work in enzyme design. This study aims to facilitate the use of PLMs in research fields like synthetic biology and environmental sustainability, even for those with limited computational resources.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to lower the threshold for using Protein Language Models (PLMs), making them more widely applied in fields such as protein function prediction and protein design. Specifically, the paper focuses on the following aspects: 1. **Computational resource limitations**: Training PLMs usually requires a large amount of computational resources, which makes it difficult for many researchers to train these models from scratch. To solve this problem, the paper integrates a pre - trained PLM into the DeepChem framework, which is an open - source computational biology and chemistry framework. In this way, researchers can utilize these powerful models without the need for a large amount of computational resources. 2. **Technical threshold**: Many potential users (such as biologists and chemists) may not have a strong background in computer science or machine learning, so it is difficult for them to directly use or fine - tune these complex models. By integrating the PLM into DeepChem, the paper provides a more user - friendly platform, enabling these users to more conveniently apply PLMs for research. 3. **Exploration of practical applications**: The paper not only evaluates the performance of the integrated model on standard protein prediction tasks, but also explores the application of using the model to generate plastic - degrading enzyme candidates. Although the results indicate that further improvement is still required, this method provides a basis for future enzyme design. ### Main contributions - **Performance evaluation**: The paper evaluates the performance of the pre - trained ProtBERT model in DeepChem and demonstrates its applicability in multiple protein - related benchmark tasks. - **Practical application cases**: By exploring the method of generating plastic - degrading enzymes, it shows the potential of PLMs in solving real - world problems. - **Open - source implementation**: The paper releases its implementation code, enabling the broader scientific research community to easily access the capabilities of large - scale PLMs without the need for a large amount of computational resources or advanced professional knowledge. Through these efforts, the paper aims to reduce the barriers to using large - scale PLMs, enabling more researchers to use these models to advance their research work.

Open-Source Protein Language Models for Function Prediction and Protein Design

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

PLM-interact: extending protein language models to predict protein-protein interactions

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

Exploring evolution-aware & -free protein language models as protein function predictors

Efficient Inference, Training, and Fine-tuning of Protein Language Models

Does protein pretrained language model facilitate the prediction of protein–ligand interaction?

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Long-context Protein Language Model

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated Escherichia coli protein solubility dataset

Structure-Infused Protein Language Models

Democratizing protein language models with parameter-efficient fine-tuning

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated protein solubility dataset

Protein language models meet reduced amino acid alphabets

Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification

Reinforcement Learning for Sequence Design Leveraging Protein Language Models

Learning immune receptor representations with protein language models

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model