A Group Symmetric Stochastic Differential Equation Model for Molecule Multi-modal Pretraining

Shengchao Liu,Weitao Du,Zhiming Ma,Hongyu Guo,Jian Tang
2023-05-28
Abstract:Molecule pretraining has quickly become the go-to schema to boost the performance of AI-based drug discovery. Naturally, molecules can be represented as 2D topological graphs or 3D geometric point clouds. Although most existing pertaining methods focus on merely the single modality, recent research has shown that maximizing the mutual information (MI) between such two modalities enhances the molecule representation ability. Meanwhile, existing molecule multi-modal pretraining approaches approximate MI based on the representation space encoded from the topology and geometry, thus resulting in the loss of critical structural information of molecules. To address this issue, we propose MoleculeSDE. MoleculeSDE leverages group symmetric (e.g., SE(3)-equivariant and reflection-antisymmetric) stochastic differential equation models to generate the 3D geometries from 2D topologies, and vice versa, directly in the input space. It not only obtains tighter MI bound but also enables prosperous downstream tasks than the previous work. By comparing with 17 pretraining baselines, we empirically verify that MoleculeSDE can learn an expressive representation with state-of-the-art performance on 26 out of 32 downstream tasks.
Machine Learning,Artificial Intelligence,Biomolecules
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to estimate mutual information (MI) more accurately during multimodal pre - training between the 2D topological structure and 3D geometric conformation of molecules while keeping the key structural information of molecules from being lost. Existing multimodal pre - training methods usually approximate mutual information based on the encoded representation space during the conversion from 2D topology to 3D geometry or vice versa, which will lead to the loss of key molecular structure information. To solve this problem, the paper proposes MoleculeSDE, a method that uses group - symmetric stochastic differential equation models (e.g., SE(3) - equivariant and reflection - antisymmetric) to directly generate 3D geometric structures in the input space. MoleculeSDE can not only obtain tighter mutual information bounds but also perform excellently in downstream tasks. Specifically, MoleculeSDE solves the problem in the following ways: 1. **More accurate estimation of mutual information**: MoleculeSDE uses the stochastic differential equation (SDE) framework to generate 3D geometric structures, thereby reducing information loss in mutual information estimation. Compared with the previous variational representation reconstruction (VRR) method, the SDE method can estimate mutual information more accurately. 2. **Maintaining key structural information**: The generation process of MoleculeSDE is directly carried out in the input space rather than in the representation space, so it can better preserve the key structural information of molecules. 3. **Supporting rich downstream tasks**: MoleculeSDE not only performs excellently in multiple downstream tasks but also can generate 3D conformations from 2D topologies, which is especially useful for tasks with only 2D topological data. Through comparison with 17 pre - trained baseline models, the experimental results show that MoleculeSDE achieves state - of - the - art performance on 26 out of 32 downstream tasks.