SELF-BART : A Transformer-based Molecular Representation Model using SELFIES

Indra Priyadarsini,Seiji Takeda,Lisa Hamada,Emilio Vital Brazil,Eduardo Soares,Hajime Shinohara
2024-10-16
Abstract:Large-scale molecular representation methods have revolutionized applications in material science, such as drug discovery, chemical modeling, and material design. With the rise of transformers, models now learn representations directly from molecular structures. In this study, we develop an encoder-decoder model based on BART that is capable of leaning molecular representations and generate new molecules. Trained on SELFIES, a robust molecular string representation, our model outperforms existing baselines in downstream tasks, demonstrating its potential in efficient and effective molecular data analysis and manipulation.
Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
This paper aims to solve the problem of large - scale molecular representation methods in materials science, especially their applications in fields such as drug discovery, chemical modeling, and materials design. With the rise of the Transformer model, it is now possible to directly learn representations from molecular structures. However, most of the existing Transformer models are encoder models based on SMILES (Simplified Molecular - Input Line - Entry System), which may lead to learning invalid molecular representations because SMILES does not guarantee the syntactic and semantic validity of molecules. To solve these problems, the paper proposes an encoder - decoder model based on BART (Bidirectional Auto - Regressive Transformer) - SELF - BART. This model uses SELFIES (Self - Referencing Embedded Strings) as the molecular string representation to overcome the shortcomings of SMILES. SELFIES not only provides a more concise and interpretable molecular representation but also ensures the validity of the generated molecular representation. In this way, SELF - BART can not only efficiently learn molecular representations but also generate new molecules autoregressively, which is of great significance for the design and generation of new molecules. Specifically, the main contributions of the paper include: 1. **Model innovation**: Propose an encoder - decoder model based on BART that can learn complex molecular relationships and interactions from molecular structures and generate new molecules. 2. **Data representation**: Use SELFIES instead of SMILES as the string representation of molecules to ensure the validity of the generated molecular representation. 3. **Performance improvement**: Experimental results on multiple benchmark datasets show that SELF - BART outperforms existing baseline models in both molecular property prediction tasks and molecular generation tasks. Through these improvements, SELF - BART shows great potential in the analysis and manipulation of molecular data, especially in the design and generation of new - type molecules.