Limeade: Let integer molecular encoding aid

Shiqiang Zhang,Christian W. Feldmann,Frederik Sandfort,Miriam Mathea,Juan S. Campos,Ruth Misener
2024-11-26
Abstract:Mixed-integer programming (MIP) is a well-established framework for computer-aided molecular design (CAMD). By precisely encoding the molecular space and score functions, e.g., a graph neural network, the molecular design problem is represented and solved as an optimization problem, the solution of which corresponds to a molecule with optimal score. However, both the extremely large search space and complicated scoring process limit the use of MIP-based CAMD to specific and tiny problems. Moreover, optimal molecule may not be meaningful in practice if scores are imperfect. Instead of pursuing optimality, this paper exploits the ability of MIP in molecular generation and proposes Limeade as an end-to-end tool from real-world needs to feasible molecules. Beyond the basic constraints for structural feasibility, Limeade supports inclusion and exclusion of SMARTS patterns, automating the process of interpreting and formulating chemical requirements to mathematical constraints.
Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
This paper attempts to solve several key problems in computer - aided molecular design (CAMD): 1. **Large - scale search space and complex scoring process**: Traditional CAMD methods based on mixed - integer programming (MIP) are limited to specific and simple problems in their application range due to the need to handle extremely large molecular spaces and complex scoring functions. Moreover, even if the optimal solution is found, these molecules may be meaningless in practical applications if the scoring function is not perfect. 2. **Molecular feasibility and practicality**: The generated molecules need to satisfy not only mathematical feasibility (i.e., conform to the constraints of the optimization model), but also practical feasibility in actual applications, such as synthetic possibility, stability, etc. 3. **Combination of knowledge - driven and data - driven methods**: How to combine the accumulated knowledge (such as the relationship between chemical structures and properties) with machine - learning models (such as graph neural networks) to improve the efficiency and accuracy of molecular design. To solve these problems, the author proposes a new framework named Limeade. The main features of Limeade include: - **End - to - end generation tool**: Limeade is a lightweight end - to - end tool that can quickly generate feasible molecules from the user's needs without requiring the user to have optimization expertise. - **Automatically handle chemical requirements**: Limeade can automatically convert chemical requirements into mathematical constraints, generate feasible solutions through MIP solvers, and finally decode the solutions into specific molecular structures. - **Support sub - structure inclusion and exclusion**: Limeade allows users to specify sub - structures that they want to include or exclude through SMARTS patterns, thereby more precisely controlling the characteristics of the generated molecules. - **Verification step**: To ensure that the generated molecules are practically feasible, Limeade will perform an additional verification step after generation, removing symmetric molecules and checking for the presence of undesired sub - structures. In summary, Limeade aims to bridge the gap between chemical engineering and mathematical programming by improving the applicability and compatibility of the MIP framework, enabling CAMD methods to be more widely applied to practical problems.