Abstract:Molecular representation learning is pivotal for various molecular property prediction tasks related to drug discovery. Robust and accurate benchmarks are essential for refining and validating current methods. Existing molecular property benchmarks derived from wet experiments, however, face limitations such as data volume constraints, unbalanced label distribution, and noisy labels. To address these issues, we construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline. We conduct extensive experiments on various deep learning models, demonstrating that our dataset offers significant physicochemical interpretability to guide model development and design. Notably, the dataset's properties are linked to binding affinity metrics, providing additional insights into model performance in drug-target interaction tasks. We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning, thereby expediting progress in the field of artificial intelligence-driven drug discovery.

What problem does this paper attempt to address?

The paper aims to address several key issues in evaluating datasets within the field of Molecular Representation Learning (MRL) to advance AI-driven drug discovery. Specifically, the paper proposes improvements to the limitations found in existing molecular property benchmark datasets (such as MoleculeNet), including data volume constraints, imbalanced label distribution, label noise, and data inconsistency. The main contribution of the paper is the construction of a new large and precise molecular representation dataset—MoleculeCLA. This dataset is generated through computational ligand-target binding analysis and includes approximately 140,000 small molecules with extensive chemical, physical, and biological properties. The design of the MoleculeCLA dataset aims to overcome the issues of existing datasets and provide a more reliable and accurate benchmarking platform for molecular representation learning methods. Features of the MoleculeCLA dataset include: 1. **Scale and Diversity**: It contains approximately 140,000 small molecules, carefully selected to ensure diversity and representativeness of the chemical space. 2. **Property Coverage**: It covers nine different molecular properties, divided into chemical properties (such as hydrophobicity, hydrogen bond formation tendency), physical properties (such as van der Waals energy, Coulomb energy, etc.), and biological properties (such as docking scores, model energy), which are closely related to the ligand-target binding process. 3. **Computational Source**: These properties are obtained through computational means rather than experimental methods, thus avoiding the inherent noise and uncertainty of experimental data. 4. **Task Organization**: Organized into multiple tasks based on different protein targets, and these tasks are used to evaluate the performance of different models in predicting specific molecular properties. The paper also conducts extensive experiments comparing the performance of various deep learning models based on Graph Neural Networks (GNN) and Transformer architectures on the MoleculeCLA dataset. The experimental results show that the MoleculeCLA dataset can provide important insights into model performance and help guide model development and design. Additionally, the study validates the effectiveness of the MoleculeCLA dataset in selecting models suitable for drug-target interaction tasks.

MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis

ComABAN: refining molecular representation with the graph attention mechanism to accelerate drug discovery

Understanding the Limitations of Deep Models for Molecular Property Prediction: Insights and Solutions.

MoleculeNet: A Benchmark for Molecular Machine Learning

A systematic study of key elements underlying molecular property prediction

MolCloze - A Unified Cloze-style Self-supervised Molecular Structure Learning Model for Chemical Property Prediction.

Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction

Activity Cliff-Informed Contrastive Learning for Molecular Property Prediction

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

Synergizing Chemical Structures and Bioassay Descriptions for Enhanced Molecular Property Prediction in Drug Discovery

GraphCL-DTA: a graph contrastive learning with molecular semantics for drug-target binding affinity prediction

Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey

FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning

Dataset Construction to Explore Chemical Space with 3D Geometry and Deep Learning

Meta-MolNet: A Cross-Domain Benchmark for Few Examples Drug Discovery

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

Binding Affinity Prediction with 3D Machine Learning: Training Data and Challenging External Testing

Benchmarking Large Language Models for Molecule Prediction Tasks

We Should at Least Be Able to Design Molecules That Dock Well

Unraveling Key Elements Underlying Molecular Property Prediction: A Systematic Study