Abstract:Chemical representation learning has gained increasing interest due to the limited availability of supervised data in fields such as drug and materials design. This interest particularly extends to chemical language representation learning, which involves pre-training Transformers on SMILES sequences -- textual descriptors of molecules. Despite its success in molecular property prediction, current practices often lead to overfitting and limited scalability due to early convergence. In this paper, we introduce a novel chemical language representation learning framework, called MolTRES, to address these issues. MolTRES incorporates generator-discriminator training, allowing the model to learn from more challenging examples that require structural understanding. In addition, we enrich molecular representations by transferring knowledge from scientific literature by integrating external materials embedding. Experimental results show that our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.

What problem does this paper attempt to address?

The paper aims to address issues in chemical language representation learning, particularly overfitting and limited scalability in molecular property prediction tasks. Specifically, the paper focuses on the following points: 1. **Limitations of current methods**: Existing chemical language representation learning methods (such as Transformer models based on SMILES sequences) tend to overfit and converge early in the pre-training stage, limiting their ability to handle large-scale data. 2. **Surface pattern problem**: There are many surface patterns in SMILES sequences that allow models to predict original labels without understanding the underlying chemical information, leading to poor model performance. 3. **Imbalanced data distribution**: The distribution of atoms in large molecular datasets is imbalanced, with elements like carbon, nitrogen, and oxygen occupying the vast majority of labels, further increasing the difficulty of model learning. To address these issues, the paper proposes a new framework called MolTRES, which combines generator-discriminator training and knowledge transfer from scientific literature. The main contributions of MolTRES include: - **Dynamic Molecular Modeling (DynaMol)**: Increasing the difficulty of pre-training tasks through a generator-discriminator training strategy, which helps improve the model's ability to learn complex molecular structures. - **External Knowledge Integration**: Integrating mat2vec word embeddings obtained from scientific literature to directly incorporate molecular property information into the model representations. Experimental results show that MolTRES outperforms existing state-of-the-art models on multiple molecular property prediction tasks, particularly excelling in classification and regression tasks. Additionally, ablation studies validate the contributions of DynaMol and mat2vec embeddings to performance improvement. In summary, the goal of this paper is to improve chemical language representation learning by proposing the MolTRES framework to overcome the limitations of existing methods, thereby enhancing the accuracy of molecular property predictions.

MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

A merged molecular representation learning for molecular properties prediction with a web-based service

Large-scale chemical language representations capture molecular structure and properties

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

Predicting Chemical Properties using Self-Attention Multi-task Learning based on SMILES Representation

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

KnoMol: A Knowledge-Enhanced Graph Transformer for Molecular Property Prediction

Improving Molecular Properties Prediction Through Latent Space Fusion

Molecular Descriptors Property Prediction Using Transformer-Based Approach

GeoT: A Geometry-aware Transformer for Reliable Molecular Property Prediction and Chemically Interpretable Representation Learning

MolPROP: Molecular Property prediction with multimodal language and graph fusion

Chemical-Reaction-Aware Molecule Representation Learning

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

MolCloze - A Unified Cloze-style Self-supervised Molecular Structure Learning Model for Chemical Property Prediction.

Pretraining Graph Transformer for Molecular Representation with Fusion of Multimodal Information

Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

Quantum-Informed Molecular Representation Learning Enhancing ADMET Property Prediction

Explainable Molecular Property Prediction: Aligning Chemical Concepts with Predictions via Language Models

Pre-trained Molecular Language Models with Random Functional Group Masking