Pretraining Molecules with Explicit Substructure Information

Yuting Ma,Shuo Yu,Yanming Shen
DOI: https://doi.org/10.1137/1.9781611978032.60
2024-01-01
Abstract:Generative self-supervised learning has recently become popular in molecular modeling because it can improve accuracy and generalization. However, existing generative self-supervised tasks often have simplified designs that do not effectively use substructure information. Substructure information is important for molecules because it can provide local semantics and capture analogous semantic information on a graph-level scale. For example, -OH, as one of the sub-structures, is typically associated with hydrophilicity. To address this limitation, we propose a novel pretraining task that incorporates substructure information into generative self-supervised tasks. This integration involves creating a substructure-based vocabulary and fusing structural insights into the representation learning process. We evaluate our approach on 10 publicly available datasets, covering diverse molecular property prediction tasks. Our results consistently show the effectiveness of incorporating substructure information compared with both contrastive and generative self-supervised pretraining methodologies.
What problem does this paper attempt to address?