MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

Jun-Hyung Park,Yeachan Kim,Mingyu Lee,Hyuntae Park,SangKeun Lee
2024-07-09
Abstract:Chemical representation learning has gained increasing interest due to the limited availability of supervised data in fields such as drug and materials design. This interest particularly extends to chemical language representation learning, which involves pre-training Transformers on SMILES sequences -- textual descriptors of molecules. Despite its success in molecular property prediction, current practices often lead to overfitting and limited scalability due to early convergence. In this paper, we introduce a novel chemical language representation learning framework, called MolTRES, to address these issues. MolTRES incorporates generator-discriminator training, allowing the model to learn from more challenging examples that require structural understanding. In addition, we enrich molecular representations by transferring knowledge from scientific literature by integrating external materials embedding. Experimental results show that our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
Chemical Physics,Materials Science,Machine Learning
What problem does this paper attempt to address?
The paper aims to address issues in chemical language representation learning, particularly overfitting and limited scalability in molecular property prediction tasks. Specifically, the paper focuses on the following points: 1. **Limitations of current methods**: Existing chemical language representation learning methods (such as Transformer models based on SMILES sequences) tend to overfit and converge early in the pre-training stage, limiting their ability to handle large-scale data. 2. **Surface pattern problem**: There are many surface patterns in SMILES sequences that allow models to predict original labels without understanding the underlying chemical information, leading to poor model performance. 3. **Imbalanced data distribution**: The distribution of atoms in large molecular datasets is imbalanced, with elements like carbon, nitrogen, and oxygen occupying the vast majority of labels, further increasing the difficulty of model learning. To address these issues, the paper proposes a new framework called MolTRES, which combines generator-discriminator training and knowledge transfer from scientific literature. The main contributions of MolTRES include: - **Dynamic Molecular Modeling (DynaMol)**: Increasing the difficulty of pre-training tasks through a generator-discriminator training strategy, which helps improve the model's ability to learn complex molecular structures. - **External Knowledge Integration**: Integrating mat2vec word embeddings obtained from scientific literature to directly incorporate molecular property information into the model representations. Experimental results show that MolTRES outperforms existing state-of-the-art models on multiple molecular property prediction tasks, particularly excelling in classification and regression tasks. Additionally, ablation studies validate the contributions of DynaMol and mat2vec embeddings to performance improvement. In summary, the goal of this paper is to improve chemical language representation learning by proposing the MolTRES framework to overcome the limitations of existing methods, thereby enhancing the accuracy of molecular property predictions.