MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

Shikun Feng,Jiaxin Zheng,Yinjun Jia,Yanwen Huang,Fengfeng Zhou,Wei-Ying Ma,Yanyan Lan
2024-06-13
Abstract:Molecular representation learning is pivotal for various molecular property prediction tasks related to drug discovery. Robust and accurate benchmarks are essential for refining and validating current methods. Existing molecular property benchmarks derived from wet experiments, however, face limitations such as data volume constraints, unbalanced label distribution, and noisy labels. To address these issues, we construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline. We conduct extensive experiments on various deep learning models, demonstrating that our dataset offers significant physicochemical interpretability to guide model development and design. Notably, the dataset's properties are linked to binding affinity metrics, providing additional insights into model performance in drug-target interaction tasks. We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning, thereby expediting progress in the field of artificial intelligence-driven drug discovery.
Chemical Physics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address several key issues in evaluating datasets within the field of Molecular Representation Learning (MRL) to advance AI-driven drug discovery. Specifically, the paper proposes improvements to the limitations found in existing molecular property benchmark datasets (such as MoleculeNet), including data volume constraints, imbalanced label distribution, label noise, and data inconsistency. The main contribution of the paper is the construction of a new large and precise molecular representation dataset—MoleculeCLA. This dataset is generated through computational ligand-target binding analysis and includes approximately 140,000 small molecules with extensive chemical, physical, and biological properties. The design of the MoleculeCLA dataset aims to overcome the issues of existing datasets and provide a more reliable and accurate benchmarking platform for molecular representation learning methods. Features of the MoleculeCLA dataset include: 1. **Scale and Diversity**: It contains approximately 140,000 small molecules, carefully selected to ensure diversity and representativeness of the chemical space. 2. **Property Coverage**: It covers nine different molecular properties, divided into chemical properties (such as hydrophobicity, hydrogen bond formation tendency), physical properties (such as van der Waals energy, Coulomb energy, etc.), and biological properties (such as docking scores, model energy), which are closely related to the ligand-target binding process. 3. **Computational Source**: These properties are obtained through computational means rather than experimental methods, thus avoiding the inherent noise and uncertainty of experimental data. 4. **Task Organization**: Organized into multiple tasks based on different protein targets, and these tasks are used to evaluate the performance of different models in predicting specific molecular properties. The paper also conducts extensive experiments comparing the performance of various deep learning models based on Graph Neural Networks (GNN) and Transformer architectures on the MoleculeCLA dataset. The experimental results show that the MoleculeCLA dataset can provide important insights into model performance and help guide model development and design. Additionally, the study validates the effectiveness of the MoleculeCLA dataset in selecting models suitable for drug-target interaction tasks.