Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular Property Prediction

Christopher Fifty,Joseph M. Paggi,Ehsan Amid,Jure Leskovec,Ron Dror
2023-10-07
Abstract:Few-shot learning is a promising approach to molecular property prediction as supervised data is often very limited. However, many important molecular properties depend on complex molecular characteristics -- such as the various 3D geometries a molecule may adopt or the types of chemical interactions it can form -- that are not explicitly encoded in the feature space and must be approximated from low amounts of data. Learning these characteristics can be difficult, especially for few-shot learning algorithms that are designed for fast adaptation to new tasks. In this work, we develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction. Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations, and a multi-task learning paradigm to structure the embedding space. On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance. Our code is available at <a class="link-external link-https" href="https://github.com/cfifty/IGNITE" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The paper aims to address the small sample learning problem in molecular property prediction, especially how to improve prediction accuracy when data is limited. Molecular property prediction is very important in biochemical fields such as drug discovery, but it is often limited by the scarcity of experimental data and high acquisition costs. The paper proposes a method called IGNITE (Implicit Geometric and Interaction Embedding), which trains molecular embeddings on a large-scale synthetic dataset. These embeddings can encode complex molecular characteristics, such as three-dimensional geometric structures and chemical interactions, thereby improving the performance of small sample learning algorithms on molecular property prediction tasks. Specifically, the paper uses the results of molecular docking calculations as synthetic data and employs a multi-task learning framework to train a Graph Neural Network (GNN) to predict the binding energy between small molecules and protein targets. The trained molecular embeddings are used to initialize the feature space of small sample learning algorithms, such as multi-task learning, Model-Agnostic Meta-Learning (MAML), and Prototypical Networks, significantly enhancing their performance on multiple molecular property prediction benchmarks, particularly in cases with smaller sample sizes. Furthermore, the paper conducts an analysis that proves IGNITE embeddings indeed capture the complex characteristics of molecules. In comparisons across different molecular spaces, the IGNITE embedding space is most similar to the docking-based distance, indicating its effectiveness in encoding the 3D conformation and chemical interaction information of molecules.