Masked Molecule Modeling: A New Paradigm of Molecular Representation Learning for Chemistry Understanding
Jiyan He,Kelei Tian,Shengjie Luo,Yaosen Min,Shuxin Zheng,Yu Shi,Dawei He,Haiguang Liu,Nenghai Yu,Liwei Wang,Ji Wu,Tie-Yan Liu
DOI: https://doi.org/10.21203/rs.3.rs-1746019/v1
2022-01-01
Abstract:Molecular representation learning is essential to deep learning for chemistry, where the molecules are embedded into continuous real-valued vectors as better representations in the large chemical space. Traditional molecular representation learning requires high-quality labels for molecules. However, the precise physicochemical or pharmacological properties of the molecules are expensive to measure and collect. Therefore, self-supervised training of deep learning models on large-scale cheap available chemical data is becoming an increasingly popular choice in research and practice. Masked auto-encoder is one of the most common self-supervised pretext tasks used in molecular representation learning. However, previous masked auto-encoder based methods focus on learning the existence of compounds, which may not be sufficient for chemical understanding. In this paper, we present the Masked Molecule Modeling (MMM), an emerging self-supervised paradigm of molecular representation learning, which is a simple yet mighty pretext task by exploiting the contextualized chemical semantics in more than two million reactions. As a result, the molecular representations learned by MMM shows promising performance on a wide range of real-world tasks and significantly outperform existing methods, including biocatalysed synthesis planning, retrosynthesis planning, binding affinity, and molecular pharmacological property prediction for drug discovery.