Hunting for peptide binders of specific targets with data-centric generative language models
Zhiwei Nie,Daixi Li,Jie Chen,Fan Xu,Yutian Liu,Jie Fu,Xudong Liu,Zhennan Wang,Yiming Ma,Kai Wang,Jingyi Zhang,Zhiheng Hu,Guoli Song,Yuxin Ye,Feng Yin,Bin Zhou,Zhihong Liu,Zigang Li,Wen Gao,Yonghong Tian
DOI: https://doi.org/10.1101/2023.12.31.573750
2024-01-01
Abstract:The increasing frequency of emerging viral infections calls for more efficient and low-cost drug design methods. Peptide binders have emerged as a strong contender to curb the pandemic due to their efficacy, safety, and specificity. Here, we propose a customizable low-cost pipeline incorporating model auditing strategy and data-centric methodology for controllable peptide generation. A generative protein language model, pretrained on approximately 140 million protein sequences, is directionally fine-tuned to generate peptides with desired properties and binding specificity. The subsequent multi-level structure screening reduces the synthetic distribution space of peptide candidates regularly to identify authentic high-quality samples, i.e. potential peptide binders, at stage. Paired with molecular dynamics simulations, the number of candidates that need to be verified in wet-lab experiments is quickly reduced from more than 2.2 million to 16. These potential binders are characterized by enhanced yeast display to determine expression levels and binding affinity to the target. The results show that only a dozen candidates need to be characterized to obtain the peptide binder with ideal binding strength and binding specificity. Overall, this work achieves efficient and low-cost peptide design based on a generative language model, increasing the speed of protein design to an unprecedented level. The proposed pipeline is customizable, that is, suitable for rapid design of multiple protein families with only minor modifications.
Bioinformatics