Abstract:Generative deep learning is reshaping drug design. Chemical language models (CLMs) – which generate molecules in the form of molecular strings – bear particular promise for this endeavor. Here, we introduce a recent deep learning architecture, termed Structured State-Space Sequence (S4) model, into de novo drug design. In addition to its unprecedented performance in various fields, S4 has a remarkable capability to capture the global properties of long sequences. This aspect is key for chemical language modeling, where complex molecular properties like bioactivity can 'emerge' from distant positions in the molecular strings. This observation gives rise to the following question: Can S4 advance chemical language modeling for de novo design? To provide an answer, we systematically benchmark S4 with state-of-the-art CLMs on an array of drug discovery tasks, such as the identification of bioactive compounds, and the design of drug-like molecules and natural products. S4 showed a superior capacity to learn complex molecular properties, while at the same time exploring diverse scaffolds. Finally, when applied prospectively to kinase inhibition, S4 designed eight of out ten molecules that were predicted as highly active by molecular dynamics simulations. Taken together, these findings advocate for the introduction of S4 into chemical language modeling -- uncovering its untapped potential in the molecular sciences.

What problem does this paper attempt to address?

The main objective of this paper is to explore the potential application of the Structured State-Space Sequence (S4) model in Chemical Language Modeling (CLM), particularly in the de novo design of drug molecules. The S4 model, as a recently proposed deep learning architecture, has a unique ability to handle long sequences and capture global properties, which is particularly important for chemical language modeling because complex properties such as the biological activity of molecules may be jointly determined by distant parts of the molecular string. The paper first introduces the background knowledge, including the concept of chemical language models, how they represent molecular structures by simplifying molecular input into Simplified Molecular Input Line Entry System (SMILES) strings, and an overview of the two currently commonly used models—Long Short-Term Memory (LSTM) networks and Transformers. Then, the paper details the working principles and advantages of the S4 model, namely that it can process input sequences globally like Transformers to learn complex global features, and generate sequences element by element like LSTM, combining the advantages of both. The core contributions of the paper are as follows: 1. **Performance Evaluation**: The paper evaluates the performance of the S4 model relative to LSTM and Transformers on tasks related to drug design, including identifying bioactive compounds, designing drug-like molecules, and natural products, through a series of benchmark tests. 2. **Case Study**: By prospectively using S4 to design kinase inhibitors and validating their potential biological activity through molecular dynamics simulations, the paper further demonstrates the effectiveness and practicality of the S4 model. 3. **Conclusion**: Experimental results show that the S4 model can not only effectively learn and generate chemically valid molecules but also explore diverse molecular scaffolds, especially excelling in capturing biological activity. Additionally, the S4 model shows superior performance in designing molecules with specific biological activities, achieving significant results particularly in the design of kinase inhibitors. In summary, this study demonstrates the potential of the S4 model as a tool for chemical language modeling, particularly in the de novo design of drug molecules, providing strong support for the further development of new drug candidates.

Chemical Language Modeling with Structured State Spaces

Chemical language modeling with structured state space sequence models

Leveraging molecular structure and bioactivity with chemical language models for de novo drug design

Chemical Language Models for Molecular Design

Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models

Automated design of multi-target ligands by generative deep learning

Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language

De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning

Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Synthesis-driven design of 3D molecules for structure-based drug discovery using geometric transformers

Geometric Deep Learning for Structure-Based Ligand Design

Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning

Adaptive language model training for molecular design

Language models in molecular discovery

Structure-based <i>de novo</i> drug design using 3D deep generative models

Large Language Models as Molecular Design Engines

Learning To Navigate The Synthetically Accessible Chemical Space Using Reinforcement Learning

Gotta be SAFE: A New Framework for Molecular Design

Structure-based drug discovery with deep learning

SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration

Structured Chemistry Reasoning with Large Language Models