fragSMILES: a Chemical String Notation for Advanced Fragment and Chirality Representation

Fabrizio Mastrolorito,Fulvio Ciriaco,Maria Vittoria Togo,Nicola Gambacorta,Daniela Trisciuzzi,Cosimo Damiano Altomare,Nicola Amoroso,Francesca Grisoni,Orazio Nicolotti
DOI: https://doi.org/10.26434/chemrxiv-2024-tm7n6
2024-07-15
Abstract:Generative models have revolutionized de novo drug design, allowing to produce molecules on-demand with desired physicochemical and pharmacological properties. String based molecular representations, such as SMILES (Simplified Molecular Input Line Entry System) strings and SELFIES (Self-Referencing Embedded Strings), have played a pivotal role in the success of generative approaches, thanks to their capacity to encode atom- and bond- information and ease-of-generation. However, such ‘atom-level’ string representations have certain limitations, in terms of capturing information on chirality, and synthetic accessibility of the corresponding designs. In this paper, we present fragSMILES, a novel fragment-based molecular representation in the form of string. fragSMILES encode fragments in a ‘chemically-meaningful’ way via a novel graph-reduction approach, allowing to obtain an efficient, interpretable, and expressive molecular representation, which also avoids fragment redundancy. fragSMILES advances the state-of-the-art of fragment-based representations, by reporting fragments and their ‘breaking’ bonds independently, without fragment redundancy. Moreover, fragSMILES also embeds information of molecular chirality, thereby overcoming known limitations of existing string notations. When compared with SMILES and SELFIES for de novo design, the fragSMILES notation showed its promise in generating molecules with desirable biochemical and scaffolds properties.
Chemistry
What problem does this paper attempt to address?
The paper proposes a new molecular representation method called fragSMILES, which is a fragment-based chemical string representation designed to improve the limitations of existing string representations such as SMILES and SELFIES in handling chirality information and synthetic accessibility. Traditional methods may lead to fragment redundancy and ambiguously encode chirality centers when representing molecules. In contrast, fragSMILES decomposes molecules into meaningful fragments through a graph simplification approach, avoiding these limitations. In fragSMILES, molecules are broken down into fragments, each with a unique code that is not influenced by its surrounding molecular environment, while containing rich fragment semantics. This method generates shorter strings that are more suitable for chemical language modeling while preserving key chemical information. In particular, fragSMILES can explicitly represent chirality information, overcoming the shortcomings of existing string representations. The paper demonstrates the advantages of fragSMILES compared to SMILES and SELFIES, especially in de novo drug design, where it can generate molecules with desirable biochemical and scaffold properties. Experimental results show that using fragSMILES for molecular design can improve novelty, uniqueness, and synthetic feasibility while maintaining similar physicochemical properties to the training set molecules. Furthermore, the study explores the impact of data augmentation on the results and the ability of fragSMILES to capture chirality and explore new molecular scaffolds. Although other methods may generate more new scaffolds in generating entirely new scaffolds, fragSMILES excels at creating novel core structures by recombining existing cyclic elements, which helps ensure the chemical stability and synthetic feasibility of the generated molecules. In summary, fragSMILES provides a more efficient and expressive tool for molecular representation and drug discovery, and is expected to promote advancements in the field of chemistry and pharmaceutical development.