Deep learning for low-data drug discovery: hurdles and opportunities

Derek van Tilborg,Helena Brinkmann,Emanuele Criscuolo,Luke Rossen,Rıza Özçelik,Francesca Grisoni
DOI: https://doi.org/10.26434/chemrxiv-2024-w0wvl
2024-01-25
Abstract:Deep learning is becoming increasingly relevant in drug discovery, from de novo design to protein structure prediction and synthesis planning. However, it is often challenged by the small data regimes typical of certain drug discovery tasks. In such scenarios, deep learning approaches – which are notoriously ‘data-hungry’ – might fail to live up to their promise. Developing novel approaches to leverage the power of deep learning in low-data scenarios is sparking great attention, and future developments are expected to propel the field further. This minireview provides an overview of recent low-data-learning approaches in drug discovery, analyzing their hurdles and advantages. Finally, we venture to provide a forecast of future research directions in low-data learning for drug discovery.
Chemistry
What problem does this paper attempt to address?
This paper explores the small data problem in deep learning for drug discovery. Although deep learning has shown promising performance in tasks such as protein structure prediction and synthesis planning, it often requires a large amount of data to achieve optimal performance. However, drug discovery datasets are typically small, containing hundreds of molecules with limited structural diversity, which limits the potential of deep learning. The paper proposes several strategies to address the low data situation: 1. Data augmentation: Increasing the quantity of training samples by generating different representations (such as SMILES strings) for the same molecule. 2. Multi-stage training strategy: Improving model performance for specific tasks by leveraging knowledge from pre-training and fine-tuning on large datasets. 3. Context enrichment: Enhancing model inputs by providing additional information or auxiliary prediction tasks, such as multi-modal learning and multi-task learning. The advantages and limitations of these methods are also discussed in the paper. For example, data augmentation can improve model performance, but excessive augmentation may lead to minimal gains. Multi-stage training utilizes transfer learning but may introduce pre-training bias. Contextually enriched training can enhance performance by combining different input types or tasks, but it also presents challenges in effectively integrating information. The paper also discusses future research directions, including improving model generalization on new molecules, causal and interpretable deep learning, geometric deep learning, and structure-guided drug discovery. Finally, the authors emphasize the necessity of evaluating and selecting low data deep learning strategies, and call for the development of metrics and datasets specifically designed for low data training.