Understanding Representations by Exploring Galaxies in Chemical Space

Jan Weinreich,Konstantin Karandashev,Guido Falk von Rudorff
2023-09-17
Abstract:We present a Monte Carlo approach for studying chemical feature distributions of molecules without training a machine learning model or performing exhaustive enumeration. The algorithm generates molecules with predefined similarity to a given one for any representation. It serves as a diagnostic tool to understand which molecules are grouped in feature space and to identify shortcomings of representations and embeddings from unsupervised learning. In this work, we first study clusters surrounding chosen molecules and demonstrate that common representations do not yield a constant density of molecules in feature space, with possible implications for learning behavior. Next, we observe a connection between representations and properties: a linear correlation between the property value of a central molecule and the average radial slope of that property in chemical space. Molecules with extremal property values have the largest property derivative values in chemical space, which provides a route to improve the data efficiency of a representation by tailoring it towards a given property. Finally, we demonstrate applications for sampling molecules with specified metric-dependent distributions to generate molecules biased toward graph spaces of interest.
Chemical Physics
What problem does this paper attempt to address?
The main objective of this paper is to study the characteristics of molecular representation methods in the Chemical Compound Space (CCS) and their relationship with molecular properties. Specifically, the paper attempts to address the following key issues: 1. **Understanding the distribution characteristics of molecular representation methods in the chemical space**: By proposing a Monte Carlo method to generate molecules with predefined similarity without training machine learning models or performing exhaustive enumeration. This method can serve as a diagnostic tool to help understand which molecules are grouped together in the feature space and identify shortcomings in molecular representations and embeddings. 2. **Exploring the connection between molecular representation methods and molecular properties**: The study finds that molecular representation methods not only affect the distribution density of molecules in the feature space but also have a linear correlation with molecular properties. Specifically, there is a linear relationship between the central property values of molecules and their average radial slopes, providing a pathway to improve data efficiency. 3. **Improving machine learning models**: By sampling molecules from specific regions of the chemical space, it is possible to generate molecules biased towards the graph space of interest. This method helps to improve molecular representation methods for specific properties. 4. **Analyzing the topological structure of the chemical space**: The research focuses on the topological structure of the chemical space generated by molecular representation methods and the association between these structures and molecular properties. Additionally, the paper explores how to adjust representation methods to better reflect actual physical phenomena, thereby enhancing the accuracy of machine learning models. In summary, this paper aims to develop a new method to gain a deeper understanding of the behavior of molecular representation methods in the chemical space and further explore the relationship between these representation methods and molecular properties. The ultimate goal is to improve the performance of machine learning models based on these representations.