Multi-Scale Protein Language Model for Unified Molecular Modeling

Kangjie Zheng,Siyu Long,Tianyu Lu,Junwei Yang,Xinyu Dai,Ming Zhang,Zaiqing Nie,Wei-Ying Ma,Hao Zhou
DOI: https://doi.org/10.1101/2024.03.04.583284
2024-05-16
Abstract:Protein language models have demonstrated significant potential in the field of protein engineering. However, current protein language models primarily operate at the residue scale, which limits their ability to provide information at the atom level. This limitation prevents us from fully exploiting the capabilities of protein language models for applications involving both proteins and small molecules. In this paper, we propose ESM-AA (ESM All-Atom), a novel approach that en- ables atom-scale and residue-scale unified molecular modeling. ESM-AA achieves this by pre-training on multi-scale code-switch protein sequences and utilizing a multi-scale position encoding to capture relationships among residues and atoms. Experimental results indicate that ESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the full utilization of protein language models. Further investigations reveal that through unified molecular modeling, ESM-AA not only gains molecular knowledge but also retains its understanding of proteins.
Bioinformatics
What problem does this paper attempt to address?
This paper proposes a multi-scale protein language model called ESM-AA for unified molecular modeling. Current protein language models mainly operate at the amino acid (residue) level and cannot provide atomic-level information, which limits their potential in a wide range of applications involving proteins and small molecules. ESM-AA captures the relationships between residues and atoms by pre-training multi-scale code-switching protein sequences and using multi-scale positional encoding, enabling unified molecular modeling at both the atomic and residue levels. Experimental results show that ESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the full utilization of protein language models. Furthermore, ESM-AA not only obtains molecular knowledge through unified molecular modeling but also retains the understanding of proteins. This work is of significant importance for advancing research in protein engineering, drug discovery, enzyme engineering, and other related fields.