ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures
Xiaohan Lin,Zhenyu Chen,Yanheng Li,Xingyu Lu,Chuanliu Fan,Ziqiang Cao,Shihao Feng,Yi Qin Gao,Jun Zhang
DOI: https://doi.org/10.1101/2023.11.27.568722
2023-01-01
Abstract:Designing protein structures towards specific functions is of great values for science, industry and therapeutics. Although backbones can be designed with arbitrary variety in the coordinate space, the generated structures may not be stabilized by any combination of natural amino acids, resulting in the high failure risk of many design approaches. Aiming to sketch a compact space for designable protein structures, we present an unsupervised learning strategy by integrating the structure prediction and the inverse folding tasks, to encode protein structures into amino-acid-like discrete tokens and decode these tokens back to 3D coordinates. We show that tokenizing protein structures with proper perplexity can lead to compact and informative representations (ProTokens), which can reconstruct 3D coordinates with high fidelity and reduce the trans-rotational equivariance of protein structures. Therefore, protein structures can be efficiently compressed, stored, aligned and compared in the form of ProTokens. Besides, ProTokens enable protein structure design via various generative AI without the concern of symmetries, and even support joint design of the structure and sequence simultaneously. Additionally, as a modality transformer, ProTokens provide a domain-specific vocabulary, allowing large language models to perceive, process and explore the microscopic structures of biomolecules as effectively as learning a foreign language.
### Competing Interest Statement
The authors have declared no competing interest.