Chemical Language Modeling with Structured State Spaces

Rıza Özçelik,Sarah de Ruiter,Francesca Grisoni,Emanuele Criscuolo
DOI: https://doi.org/10.26434/chemrxiv-2023-jwmf3-v2
2024-01-25
Abstract:Generative deep learning is reshaping drug design. Chemical language models (CLMs) – which generate molecules in the form of molecular strings – bear particular promise for this endeavor. Here, we introduce a recent deep learning architecture, termed Structured State-Space Sequence (S4) model, into de novo drug design. In addition to its unprecedented performance in various fields, S4 has a remarkable capability to capture the global properties of long sequences. This aspect is key for chemical language modeling, where complex molecular properties like bioactivity can 'emerge' from distant positions in the molecular strings. This observation gives rise to the following question: Can S4 advance chemical language modeling for de novo design? To provide an answer, we systematically benchmark S4 with state-of-the-art CLMs on an array of drug discovery tasks, such as the identification of bioactive compounds, and the design of drug-like molecules and natural products. S4 showed a superior capacity to learn complex molecular properties, while at the same time exploring diverse scaffolds. Finally, when applied prospectively to kinase inhibition, S4 designed eight of out ten molecules that were predicted as highly active by molecular dynamics simulations. Taken together, these findings advocate for the introduction of S4 into chemical language modeling -- uncovering its untapped potential in the molecular sciences.
Chemistry
What problem does this paper attempt to address?
The main objective of this paper is to explore the potential application of the Structured State-Space Sequence (S4) model in Chemical Language Modeling (CLM), particularly in the de novo design of drug molecules. The S4 model, as a recently proposed deep learning architecture, has a unique ability to handle long sequences and capture global properties, which is particularly important for chemical language modeling because complex properties such as the biological activity of molecules may be jointly determined by distant parts of the molecular string. The paper first introduces the background knowledge, including the concept of chemical language models, how they represent molecular structures by simplifying molecular input into Simplified Molecular Input Line Entry System (SMILES) strings, and an overview of the two currently commonly used models—Long Short-Term Memory (LSTM) networks and Transformers. Then, the paper details the working principles and advantages of the S4 model, namely that it can process input sequences globally like Transformers to learn complex global features, and generate sequences element by element like LSTM, combining the advantages of both. The core contributions of the paper are as follows: 1. **Performance Evaluation**: The paper evaluates the performance of the S4 model relative to LSTM and Transformers on tasks related to drug design, including identifying bioactive compounds, designing drug-like molecules, and natural products, through a series of benchmark tests. 2. **Case Study**: By prospectively using S4 to design kinase inhibitors and validating their potential biological activity through molecular dynamics simulations, the paper further demonstrates the effectiveness and practicality of the S4 model. 3. **Conclusion**: Experimental results show that the S4 model can not only effectively learn and generate chemically valid molecules but also explore diverse molecular scaffolds, especially excelling in capturing biological activity. Additionally, the S4 model shows superior performance in designing molecules with specific biological activities, achieving significant results particularly in the design of kinase inhibitors. In summary, this study demonstrates the potential of the S4 model as a tool for chemical language modeling, particularly in the de novo design of drug molecules, providing strong support for the further development of new drug candidates.