Improving Molecular Representation Learning with Metric Learning-enhanced Optimal Transport

Fang Wu,Nicolas Courty,Shuting Jin,Stan Z. Li
2023-10-30
Abstract:Training data are usually limited or heterogeneous in many chemical and biological applications. Existing machine learning models for chemistry and materials science fail to consider generalizing beyond training domains. In this article, we develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems. MROT learns a continuous label of the data by measuring a new metric of domain distances and a posterior variance regularization over the transport plan to bridge the chemical domain gap. Among downstream tasks, we consider basic chemical regression tasks in unsupervised and semi-supervised settings, including chemical property prediction and materials adsorption selection. Extensive experiments show that MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances with desired properties.
Machine Learning,Artificial Intelligence,Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to overcome the limitations and heterogeneity of training data in molecular representation learning in order to improve the generalization ability of the model across different domains. Specifically, the authors propose a method based on Optimal Transport (OT) - MROT (Metric learning - enhanced Optimal Transport), aiming to enhance the generalization ability in molecular regression tasks by minimizing the distance between the source domain and the target domain. This method is particularly suitable for molecular property prediction in unsupervised and semi - supervised fields and material adsorption selection tasks, and can solve prediction problems beyond the scope of training samples. ### Core Problems of the Paper 1. **Data Limitations and Heterogeneity**: In many chemical and biological applications, training data is usually limited or heterogeneous, which restricts the generalization ability of existing machine - learning models in chemistry and materials science. 2. **Domain Adaptation (DA)**: How to perform effective domain adaptation between different data distributions, especially in molecular regression tasks, how to use existing training data to predict molecular properties in new domains. ### Solutions - **MROT Method**: MROT bridges the gap in the chemical field by introducing new metrics and posterior variance regularization. Specific steps include: - **Metric Learning**: Design different metrics to measure the distance between different domains. - **Posterior Variance Regularization**: Introduce posterior variance regularization in the transport plan to make full use of the regression label information of the source domain. - **Dynamic Hierarchical Triplet Loss**: Combine dynamic hierarchical triplet loss to help obtain a more distinguishable feature space and avoid ambiguous decision boundaries. ### Experimental Verification - **Data Sets**: The paper conducted experiments on multiple benchmark data sets, including quantum chemistry (QM7, QM8, QM9), physical chemistry (ESOL, FreeSolv, Lipophilicity) and materials science (CoRE - MOF, Exp - MOF). - **Performance Evaluation**: The experimental results show that MROT significantly outperforms existing baseline methods in unsupervised and semi - supervised tasks, especially when dealing with large - scale data sets, its advantages are more obvious. ### Main Contributions - **Generalization Ability Improvement**: MROT improves the generalization ability of the model in different domains through the optimal transport strategy, especially outstanding in molecular regression tasks. - **Domain Adaptation**: Proposes a new domain adaptation method that can effectively deal with data distribution differences in chemistry and materials science. - **Practical Applications**: This method has potential application value in fields such as drug discovery and materials synthesis, and can accelerate the discovery of new substances. ### Conclusion The paper successfully solves the problems of limitations and heterogeneity of training data in molecular representation learning by proposing the MROT method and improves the generalization ability of the model in different domains. This result provides new ideas and technical means for domain adaptation in chemistry and materials science.