Hierarchical Grammar-Induced Geometry for Data-Efficient Molecular Property Prediction

Minghao Guo,Veronika Thost,Samuel W Song,Adithya Balachandran,Payel Das,Jie Chen,Wojciech Matusik
2023-09-05
Abstract:The prediction of molecular properties is a crucial task in the field of material and drug discovery. The potential benefits of using deep learning techniques are reflected in the wealth of recent literature. Still, these techniques are faced with a common challenge in practice: Labeled data are limited by the cost of manual extraction from literature and laborious experimentation. In this work, we propose a data-efficient property predictor by utilizing a learnable hierarchical molecular grammar that can generate molecules from grammar production rules. Such a grammar induces an explicit geometry of the space of molecular graphs, which provides an informative prior on molecular structural similarity. The property prediction is performed using graph neural diffusion over the grammar-induced geometry. On both small and large datasets, our evaluation shows that this approach outperforms a wide spectrum of baselines, including supervised and pre-trained graph neural networks. We include a detailed ablation study and further analysis of our solution, showing its effectiveness in cases with extremely limited data. Code is available at <a class="link-external link-https" href="https://github.com/gmh14/Geo-DEG" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The paper aims to address the issue of data efficiency in molecular property prediction, particularly in making effective predictions when labeled data is limited. Specifically, the study proposes a method based on a learnable hierarchical molecular grammar to improve the accuracy of molecular property predictions, especially on small datasets. The core contributions of the paper can be summarized as follows: 1. **Proposed a new framework**: This framework utilizes a learnable hierarchical molecular grammar to construct the geometric structure of the molecular space and uses this structure to predict molecular properties. This approach is particularly useful in data-scarce situations because it can infer structural similarities between molecules from a small amount of data. 2. **Developed a hierarchical molecular grammar**: The authors developed a hierarchical molecular grammar consisting of two parts: a predefined meta-grammar (used to generate tree structures) and a learnable molecular grammar (used to convert trees into specific molecules). This method not only retains the advantages of general molecular grammar (such as explicitness and interpretability) but is also more compact and practical. 3. **Experimental results**: Through experimental evaluation on multiple benchmark datasets, this method significantly outperforms existing graph neural networks and other pre-trained models on small datasets. Even with very few training samples, this method can achieve performance comparable to pre-trained models fine-tuned using the entire training set. In summary, this paper addresses the problem of efficient molecular property prediction with limited data by introducing a novel hierarchical molecular grammar and its induced geometric structure. This method not only offers theoretical innovation but also demonstrates good performance in practical applications.