MolCloze - A Unified Cloze-style Self-supervised Molecular Structure Learning Model for Chemical Property Prediction.

Yingheng Wang,Xin Chen,Yaosen Min,Ji Wu
DOI: https://doi.org/10.1109/bibm52615.2021.9669794
2021-01-01
Abstract:Machine Learning approaches are required to predict accurately on test samples that are distributionally different from training ones in the fields of drug discovery, computational biology, and cheminformatics. However, (i) labeled task-specific molecule data are often scarce, and (ii) poor generalization due to test molecules that are structurally different from those seen during training. To alleviate the problems, we propose a cloze-style self-supervised learning model (MolCloze) to obtain universal informative representations for molecular property prediction tasks. With carefully designed self-supervised tasks unifying generative- and discriminative-paradigm, MolCloze can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. To capture such complex information, we design two novel strategies - Structural Fingerprint Tokenization (SFT) for better tokenizing molecule graphs, and Normalized Graph Raw Shortcut-connection (NGRS) for better latent representations by training a deeper model. We pretrain the MolCloze model via three tasks, which are Unordered Masked Language Modeling (UMLM), Replaced Masked Token Detection (RMTD), and Contrastive Energy-based Unmasked Token Clozing (CE-UTC). Then, we transfer the pre-trained model to a broad range of downstream molecular property prediction tasks via minor architecture modification. Extensive experiments demonstrate the generalizability of MolCloze by predicting a broad range of chemical properties which are related to drug discovery. We also observe significant performance boost on different downstream molecular property prediction datasets, achieving higher performance than the state-of-the-art baseline approaches and previous pre-training techniques developed for molecule data.
What problem does this paper attempt to address?