Gotta be SAFE: A New Framework for Molecular Design

Emmanuel Noutahi,Cristian Gabellini,Michael Craig,Jonathan S.C Lim,Prudencio Tossou
2023-12-11
Abstract:Traditional molecular string representations, such as SMILES, often pose challenges for AI-driven molecular design due to their non-sequential depiction of molecular substructures. To address this issue, we introduce Sequential Attachment-based Fragment Embedding (SAFE), a novel line notation for chemical structures. SAFE reimagines SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. It streamlines complex generative tasks, including scaffold decoration, fragment linking, polymer generation, and scaffold hopping, while facilitating autoregressive generation for fragment-constrained design, thereby eliminating the need for intricate decoding or graph-based models. We demonstrate the effectiveness of SAFE by training an 87-million-parameter GPT2-like model on a dataset containing 1.1 billion SAFE representations. Through targeted experimentation, we show that our SAFE-GPT model exhibits versatile and robust optimization performance. SAFE opens up new avenues for the rapid exploration of chemical space under various constraints, promising breakthroughs in AI-driven molecular design.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The paper proposes a new framework called Sequential Attachment-based Fragment Embedding (SAFE) to address a key problem in molecular design. Traditional molecular string representations, such as SMILES, have difficulty in continuously representing molecular substructures, which hinders AI-driven molecular design. SAFE solves this problem by reimagining SMILES strings as an unordered, interconnected sequence of fragment blocks, while maintaining compatibility with existing SMILES parsers. SAFE simplifies complex generation tasks including skeleton modification, fragment linking, polymer generation, and skeletal jumping, and supports autoregressive generation under fragment constraints, eliminating the need for complex decoding or graph-based models. The paper demonstrates the effectiveness of SAFE by training an 87 million parameter GPT2-like model and pretraining it on a large dataset containing 1.1 billion SAFE representations. Experimental results show that the SAFE-GPT model exhibits flexible and robust optimization performance in various molecular generation tasks. Furthermore, the paper proposes a new benchmark to evaluate the performance of pure generation models in drug discovery challenges, such as skeleton modification, linker design, and scaffold expansion. Compared to SMILES, SELFIES, and other linear molecular representations, SAFE has advantages in maintaining molecular skeleton and fragment integrity, making it particularly suitable for fragment-based molecular design tasks. In conclusion, SAFE opens up new avenues for AI-driven molecular design, allowing for rapid exploration of chemical space under various constraints and promising breakthroughs in this field.