3DMolNet: A Generative Network for Molecular Structures

Vitali Nesterov,Mario Wieser,Volker Roth
DOI: https://doi.org/10.48550/arXiv.2010.06477
2020-10-08
Abstract:With the recent advances in machine learning for quantum chemistry, it is now possible to predict the chemical properties of compounds and to generate novel molecules. Existing generative models mostly use a string- or graph-based representation, but the precise three-dimensional coordinates of the atoms are usually not encoded. First attempts in this direction have been proposed, where autoregressive or GAN-based models generate atom coordinates. Those either lack a latent space in the autoregressive setting, such that a smooth exploration of the compound space is not possible, or cannot generalize to varying chemical compositions. We propose a new approach to efficiently generate molecular structures that are not restricted to a fixed size or composition. Our model is based on the variational autoencoder which learns a translation-, rotation-, and permutation-invariant low-dimensional representation of molecules. Our experiments yield a mean reconstruction error below 0.05 Angstrom, outperforming the current state-of-the-art methods by a factor of four, and which is even lower than the spatial quantization error of most chemical descriptors. The compositional and structural validity of newly generated molecules has been confirmed by quantum chemical methods in a set of experiments.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key limitations in existing generative models when generating molecular structures. Specifically: 1. **Lack of translation, rotation and permutation invariance**: Most existing generative models use string or graph representations, but these methods usually do not encode the exact three - dimensional coordinates of atoms. This results in the model being unable to distinguish between molecules with the same chemical composition but different geometric configurations (i.e., isomers), thus limiting the exploration of the chemical compound space (CCS). 2. **Lack of continuous latent space representation**: Although autoregressive models can generate atomic coordinates, they lack a continuous latent space, making it impossible to smoothly explore the compound space. In addition, the probability of completion of such models decreases as the number of sampling steps increases when generating complex structures. 3. **Limitations of fixed chemical composition**: Although GAN - based methods can generate Euclidean distance matrices (EDM), they are limited by a fixed chemical composition to avoid permutation problems. This further limits the generalization ability of the model, especially when dealing with molecules of different chemical compositions. To solve these problems, the authors propose 3DMolNet, a generative network based on variational auto - encoders (VAE), which aims to efficiently generate 3D molecular structures of variable sizes and chemical compositions. The main contributions of 3DMolNet include: 1. **Introduction of molecular representations with translation, rotation and permutation invariance**: The permutation problem is solved by using the canonical ordering of atoms and coordinate pairs. 2. **Proposing an extended variational auto - encoder model**: This model can generate 3D molecular structures in a continuous low - dimensional latent space and allows for smooth exploration of the molecular domain. 3. **Outperforming existing methods on the QM9 dataset**: Experimental results show that the root - mean - square deviation (RMSD) of 3DMolNet in reconstructing heavy - atom coordinates is less than 0.05 Å, which is nearly four times better than the current state - of - the - art methods. In addition, the composition and structural validity of the newly generated molecules are verified by quantum - chemical methods. Through these improvements, 3DMolNet not only improves the accuracy of generating molecular structures, but also provides a more flexible and powerful tool for exploring the chemical compound space.