Context-enriched molecule representations improve few-shot drug discovery

Johannes Schimunek,Philipp Seidl,Lukas Friedrich,Daniel Kuhn,Friedrich Rippmann,Sepp Hochreiter,Günter Klambauer
2023-04-25
Abstract:A central task in computational drug discovery is to construct models from known active molecules to find further promising molecules for subsequent screening. However, typically only very few active molecules are known. Therefore, few-shot learning methods have the potential to improve the effectiveness of this critical phase of the drug discovery process. We introduce a new method for few-shot drug discovery. Its main idea is to enrich a molecule representation by knowledge about known context or reference molecules. Our novel concept for molecule representation enrichment is to associate molecules from both the support set and the query set with a large set of reference (context) molecules through a Modern Hopfield Network. Intuitively, this enrichment step is analogous to a human expert who would associate a given molecule with familiar molecules whose properties are known. The enrichment step reinforces and amplifies the covariance structure of the data, while simultaneously removing spurious correlations arising from the decoration of molecules. Our approach is compared with other few-shot methods for drug discovery on the FS-Mol benchmark dataset. On FS-Mol, our approach outperforms all compared methods and therefore sets a new state-of-the art for few-shot learning in drug discovery. An ablation study shows that the enrichment step of our method is the key to improve the predictive quality. In a domain shift experiment, we further demonstrate the robustness of our method. Code is available at <a class="link-external link-https" href="https://github.com/ml-jku/MHNfs" rel="external noopener nofollow">this https URL</a>.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
### The Problem This Paper Attempts to Solve This paper aims to address the issue of data scarcity in the drug discovery process, particularly how to improve the effectiveness of predictive models when only a small number of known active molecules are available. Specifically: 1. **Low Data Problem in Drug Discovery**: - The drug discovery process typically requires a large amount of biometrics data, but the amount of data available in actual projects is very limited. Traditional deep learning methods require hundreds or thousands of data points to train high-accuracy predictive models. - In drug design projects, obtaining large amounts of data is very difficult due to the expensive and time-consuming nature of in vitro experiments. 2. **Improving Low-Sample Learning Methods**: - Existing low-sample learning methods often perform worse than simple baseline models in drug discovery tasks, as these methods tend to ignore background information (such as similar molecules and similar activities). - To address this, the paper proposes a new method that enriches molecular representations by associating the query set and support set with a large number of background molecules, thereby improving the quality of the predictive model. ### Main Contributions 1. **Proposing a New Architecture MHNfs**: - Utilizing modern Hopfield networks (MHN) to enhance molecular representations, achieving state-of-the-art results on the FS-Mol benchmark dataset. 2. **Introducing the Concept of Background Enhancement**: - Enriching molecular representations by associating them with a large number of background molecules, thereby improving the model's generalization ability. 3. **Adding a Simple Baseline**: - Adding a simple baseline model to the FS-Mol benchmark dataset, which outperforms most published low-sample learning methods. 4. **Experimental Validation**: - Further demonstrating the effectiveness of the new method through ablation studies and domain transfer experiments.