Tetrahedral Molecular Pretraining for Enhanced Property Prediction

Yuancheng Sun, Kai Chen, Kang Liu, Qiwei Ye
Abstract:Self-supervised learning on 3D molecular structures has emerged as a promising direction in data-driven scientific research, addressing the significant challenge of limited annotated biochemical data. While the identification of semantic units for pretraining has been well-established in natural language processing and computer vision, determining optimal building blocks for characterizing 3D molecular architectures remains underexplored. We present Tetrahedral Molecular Pretraining (TMP), a novel approach that recognizes tetrahedrons as fundamental building blocks, leveraging their geometric simplicity and recurring presence across chemical functional groups. Through systematic perturbation and reconstruction of tetrahedral substructures, TMP implements a self-supervised learning strategy that recovers both their global arrangements and local patterns, learning rich molecular representations that encode multi-scale structural information. Extensive evaluations on 24 benchmark datasets demonstrate that TMP consistently outperforms existing methods in tasks ranging from biochemical property prediction to quantum property prediction. Notably, the tetrahedra-based modeling successfully scales from small molecules to complex protein-ligand systems, achieving new state-of-the-art results in binding affinity prediction. Our findings highlight how the identification of representative structural patterns can lead to more expressive and interpretable neural networks for scientific applications.
What problem does this paper attempt to address?