Surrogate-Based Black-Box Optimization Method for Costly Molecular Properties

Jules Leguy,Thomas Cauchy,Beatrice Duval,Benoit Da Mota
DOI: https://doi.org/10.48550/arXiv.2110.03522
2021-10-01
Abstract:AI-assisted molecular optimization is a very active research field as it is expected to provide the next-generation drugs and molecular materials. An important difficulty is that the properties to be optimized rely on costly evaluations. Machine learning methods are investigated with success to predict these properties, but show generalization issues on less known areas of the chemical space. We propose here a surrogate-based black box optimization method, to tackle jointly the optimization and machine learning problems. It consists in optimizing the expected improvement of the surrogate of a molecular property using an evolutionary algorithm. The surrogate is defined as a Gaussian Process Regression (GPR) model, learned on a relevant area of the search space with respect to the property to be optimized. We show that our approach can successfully optimize a costly property of interest much faster than a purely metaheuristic approach.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of high computational cost encountered in molecular property optimization. Specifically, optimizing molecular properties (such as electronic properties) usually depends on quantum mechanics (QM) calculations, which are very time - consuming and resource - intensive. To meet this challenge, the authors propose a black - box optimization method based on surrogate models to optimize these expensive molecular properties more efficiently. #### Main problems: 1. **Expensive quantum mechanics calculations**: Accurately estimating molecular properties requires expensive quantum mechanics calculations, which limit the possibility of large - scale exploration and optimization. 2. **Generalization problems of machine - learning models**: Although machine - learning methods can predict molecular properties, in the unknown regions of chemical space, these models have poor generalization performance. 3. **Limitations of data sets**: Existing quantum chemistry data sets (such as QM9) lack chemical diversity, and optimized samples are rare in the training set. #### Solutions: - **Black - box optimization method based on surrogate models**: By using Gaussian process regression (GPR) as a surrogate model and combining it with an evolutionary algorithm to optimize the expected improvement (EI) on the molecular graph, the need for expensive quantum mechanics calculations is reduced. - **Adaptive data selection strategy**: By selecting samples in the search space related to the target property, an efficient training data set is constructed, further reducing the computational cost. - **Method suitable for small amounts of data**: This method can effectively optimize molecular properties with only a small amount of initial data. For example, the initial data set used in the experiment contains only a simple molecule (methane). This method not only improves the optimization efficiency but also can provide reasonable approximate solutions in the case of scarce data, providing new ideas and tools for drug design and new material development. ### Summary This paper solves the problem of high computational cost in molecular property optimization by introducing a black - box optimization framework based on surrogate models. This method combines Gaussian process regression and an evolutionary algorithm and can achieve efficient molecular property optimization with less data and computational resources.