Abstract:Training data are usually limited or heterogeneous in many chemical and biological applications. Existing machine learning models for chemistry and materials science fail to consider generalizing beyond training domains. In this article, we develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems. MROT learns a continuous label of the data by measuring a new metric of domain distances and a posterior variance regularization over the transport plan to bridge the chemical domain gap. Among downstream tasks, we consider basic chemical regression tasks in unsupervised and semi-supervised settings, including chemical property prediction and materials adsorption selection. Extensive experiments show that MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances with desired properties.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to overcome the limitations and heterogeneity of training data in molecular representation learning in order to improve the generalization ability of the model across different domains. Specifically, the authors propose a method based on Optimal Transport (OT) - MROT (Metric learning - enhanced Optimal Transport), aiming to enhance the generalization ability in molecular regression tasks by minimizing the distance between the source domain and the target domain. This method is particularly suitable for molecular property prediction in unsupervised and semi - supervised fields and material adsorption selection tasks, and can solve prediction problems beyond the scope of training samples. ### Core Problems of the Paper 1. **Data Limitations and Heterogeneity**: In many chemical and biological applications, training data is usually limited or heterogeneous, which restricts the generalization ability of existing machine - learning models in chemistry and materials science. 2. **Domain Adaptation (DA)**: How to perform effective domain adaptation between different data distributions, especially in molecular regression tasks, how to use existing training data to predict molecular properties in new domains. ### Solutions - **MROT Method**: MROT bridges the gap in the chemical field by introducing new metrics and posterior variance regularization. Specific steps include: - **Metric Learning**: Design different metrics to measure the distance between different domains. - **Posterior Variance Regularization**: Introduce posterior variance regularization in the transport plan to make full use of the regression label information of the source domain. - **Dynamic Hierarchical Triplet Loss**: Combine dynamic hierarchical triplet loss to help obtain a more distinguishable feature space and avoid ambiguous decision boundaries. ### Experimental Verification - **Data Sets**: The paper conducted experiments on multiple benchmark data sets, including quantum chemistry (QM7, QM8, QM9), physical chemistry (ESOL, FreeSolv, Lipophilicity) and materials science (CoRE - MOF, Exp - MOF). - **Performance Evaluation**: The experimental results show that MROT significantly outperforms existing baseline methods in unsupervised and semi - supervised tasks, especially when dealing with large - scale data sets, its advantages are more obvious. ### Main Contributions - **Generalization Ability Improvement**: MROT improves the generalization ability of the model in different domains through the optimal transport strategy, especially outstanding in molecular regression tasks. - **Domain Adaptation**: Proposes a new domain adaptation method that can effectively deal with data distribution differences in chemistry and materials science. - **Practical Applications**: This method has potential application value in fields such as drug discovery and materials synthesis, and can accelerate the discovery of new substances. ### Conclusion The paper successfully solves the problems of limitations and heterogeneity of training data in molecular representation learning by proposing the MROT method and improves the generalization ability of the model in different domains. This result provides new ideas and technical means for domain adaptation in chemistry and materials science.

Improving Molecular Representation Learning with Metric Learning-enhanced Optimal Transport

Metric Learning-enhanced Optimal Transport for Biochemical Regression Domain Adaptation

Fast and Effective Molecular Property Prediction with Transferability Map

MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

MolTC: Towards Molecular Relational Modeling In Language Models

Improving Molecular Graph Generation with Flow Matching and Optimal Transport

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Relocating a Sense of Place Using the Participatory Geoweb: The Historical Document Database of the Métis Nation of British Columbia

Text-Guided Multi-Property Molecular Optimization with a Diffusion Language Model

Metric Learning in Optimal Transport for Domain Adaptation

Distribution Learning for Molecular Regression

MvMRL: a multi-view molecular representation learning method for molecular property prediction

Pretraining Graph Transformer for Molecular Representation with Fusion of Multimodal Information

Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval

MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts

Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT

Improved accuracy and transferability of molecular-orbital-based machine learning: Organics, transition-metal complexes, non-covalent interactions, and transition states

A physics-inspired approach to the understanding of molecular representations and models

Chemical-Reaction-Aware Molecule Representation Learning

Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey