FoldToken2: Learning compact, invariant and generative protein structure language

Zhangyang Gao,Cheng Tan,Stan Z Li
DOI: https://doi.org/10.1101/2024.06.11.598584
2024-06-13
Abstract:The equivariant nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivariant structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20\% in TMScore and 81\% in RMSD. FoldToken2 is likely the first method that works well for both single-chain and multi-chain protein structure quantization. We believe that FoldToken2 will inspire further improvement in protein structure representation, alignment, and generation tasks. Online example is available at \href{https://colab.research.google.com/drive/1_z7qy4Vpomy7kzn1oxbjVHUTik47HSSX?usp=sharing}{\textcolor{red}{Colab}}.
Molecular Biology
What problem does this paper attempt to address?