Unsupervised manifold embedding to encode molecular quantum information for supervised learning of chemical data

Tonglei Li,Nicholas J. Huls,Shan Lu,Peng Hou
DOI: https://doi.org/10.1038/s42004-024-01217-z
IF: 7.211
2024-06-12
Communications Chemistry
Abstract:Molecular representation is critical in chemical machine learning. It governs the complexity of model development and the fulfillment of training data to avoid either over- or under-fitting. As electronic structures and associated attributes are the root cause for molecular interactions and their manifested properties, we have sought to examine the local electron information on a molecular manifold to understand and predict molecular interactions. Our efforts led to the development of a lower-dimensional representation of a molecular manifold, Manifold Embedding of Molecular Surface (MEMS), to embody surface electronic quantities. By treating a molecular surface as a manifold and computing its embeddings, the embedded electronic attributes retain the chemical intuition of molecular interactions. MEMS can be further featurized as input for chemical learning. Our solubility prediction with MEMS demonstrated the feasibility of both shallow and deep learning by neural networks, suggesting that MEMS is expressive and robust against dimensionality reduction.
chemistry, multidisciplinary
What problem does this paper attempt to address?
The paper primarily aims to address two key challenges in the field of chemical machine learning: the Curse of Dimensionality (COD) and the issues arising from the empirical nature of descriptors. To overcome these challenges, the authors propose a novel molecular representation method called Manifold Embedding of Molecular Surface (MEMS). This method aims to preserve quantum chemical information by embedding the electron density and other local electronic properties on the molecular surface in a lower dimension, directly associating them with molecular properties. Specifically, MEMS treats a molecular surface as a manifold and computes its embedding, thereby retaining the chemical intuition of molecular interactions in the embedded electronic properties. The paper also demonstrates how MEMS can be used for supervised learning in solvent prediction, including the application of both shallow and deep neural networks. Experimental results show that MEMS not only effectively expresses molecular information but is also robust to dimensionality reduction. Additionally, by further characterizing MEMS with shape context matrices, dimensionality can be further reduced while retaining electronic property information on the molecular surface. In summary, this study aims to develop a new molecular representation method to address the common problem of the Curse of Dimensionality in chemical machine learning and improve the accuracy of predicting molecular properties.