InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions
Jiezhong Qiu,Junde Xu,Jie Hu,Hanqun Cao,Liya Hou,Zijun Gao,Xinyi Zhou,Anni Li,Xiujuan Li,Bin Cui,Fei Yang,Shuang Peng,Ning Sun,Fangyu Wang,Aimin Pan,Jie Tang,Jieping Ye,Junyang Lin,Jin Tang,Xingxu Huang,Pheng Ann Heng,Guangyong Chen
DOI: https://doi.org/10.1101/2024.04.17.589642
2024-04-20
Abstract:Large language models are renowned for their efficacy in capturing intricate patterns, including co-evolutionary relationships, and underlying protein languages. However, current methodologies often fall short in illustrating the emergence of genomic insertions, duplications, and insertion/deletions (indels), which account for approximately 14% of human pathogenic mutations. Given that structure dictates function, mutated proteins with similar structures are more likely to persist throughout biological evolution. Motivated by this, we leverage crossmodality alignment and instruct fine-tuning techniques inspired by large language models to align a generative protein language model with protein structure instructions. Specifically, we present a method for generating variable-length and diverse proteins to explore and simulate the complex evolution of life, thereby expanding the repertoire of options for protein engineering. Our proposed protein LM-based approach, InstructPLM, demonstrates significant performance enhancements both in silico and in vitro. On native protein backbones, it achieves a perplexity of 2.68 and a sequence recovery rate of 57.51, surpassing Protein-MPNN by 39.2% and 25.1%, respectively. Furthermore, we validate the efficacy of our model by redesigning PETase and L-MDH. For PETase, all fifteen designed variable-length PETase exhibit depolymerization activity, with eleven surpassing the activity levels of the wild type. Regarding L-MDH, an enzyme lacking an experimentally determined structure, InstructPLM is able to design functional enzymes with an AF2-predicted structure. Code and model weights of InstructPLM are publicly available .
Bioinformatics