Smart Distributed Data Factory: Volunteer Computing Platform for Active Learning-Driven Molecular Data Acquisition

Tsolak Ghukasyan,Vahagn Altunyan,Aram Bughdaryan,Tigran Aghajanyan,Khachik Smbatyan,Garegin A. Papoian,Garik Petrosyan
DOI: https://doi.org/10.1101/2024.10.22.619651
2024-11-11
Abstract:This paper presents the Smart Distributed Data Factory (SDDF), an AI-driven distributed computing platform designed to address challenges in drug discovery by creating comprehensive datasets of molecular conformations and their properties. SDDF uses volunteer computing, leveraging the processing power of personal computers worldwide to accelerate quantum chemistry (DFT) calculations. To tackle the vast chemical space and limited high-quality data, SDDF employs an ensemble of machine learning models to predict molecular properties and selectively choose the most challenging data points for further DFT calculations. The platform also generates new molecular conformations using molecular dynamics with the forces derived from these models. SDDF makes several contributions: the volunteer computing platform for DFT calculations; an active learning framework for constructing a dataset of molecular conformations; a large public dataset of diverse ENAMINE molecules with calculated energies; an ensemble of state-of-the-art ML models for accurate energy prediction. The energy dataset was generated to validate the SDDF approach of reducing the need for extensive calculations. With its strict scaffold split, the dataset can be used for training and benchmarking energy models. By combining active learning, distributed computing, and quantum chemistry, SDDF offers a scalable, cost-effective solution for developing accurate molecular models and ultimately accelerating drug discovery.
Biophysics
What problem does this paper attempt to address?