Molecular set representation learning

Daniel Probst,Maria Boulougouri,Pierre Vandergheynst

DOI: https://doi.org/10.26434/chemrxiv-2023-fk7kf-v2

2024-03-15

Abstract:Computational representation of molecules can take many forms, including graphs, string-encodings of graphs, binary vectors, or learned embeddings in the form of real-valued vectors. These representations are then used in downstream classification and regression tasks using a wide range of machine-learning models. However, existing models come with limitations, such as the requirement for clearly defined chemical bonds, which often do not represent the true underlying nature of a molecule. Here, we propose a framework for molecular machine learning tasks based on set representation learning. We show that learning on sets of atomic invariants alone reaches the performance of state-of-the-art graph-based models on the most-used chemical benchmark data sets and that introducing a set representation layer into graph neural networks can surpass the performance of established methods in the domains of chemistry, biology, and material science. We introduce specialised set representation-based neural network architectures for reaction yield and protein-ligand binding affinity prediction. Overall, we show that the technique we denote molecular set representation learning is both an alternative and an extension to graph neural network architectures for machine learning tasks on molecules, molecule complexes, and chemical reactions.

Chemistry

What problem does this paper attempt to address?

The paper proposes a new molecular machine learning framework based on set representation learning instead of the traditional graph neural networks. Existing molecular representation methods such as graph encoding may have limitations and cannot fully reflect the true nature of molecules, especially non-covalent bonds and dynamic interactions. The researchers construct molecules through a set representation of atomic invariants, which does not rely on explicit chemical bond definitions but preserves the implicit information of molecular structure. They design several set-based neural network architectures and compare them with graph neural network models on multiple chemical tasks. The results show that set representation learning can not only compete with graph neural networks but also outperform them in certain cases, particularly in the fields of drug discovery, material science, and chemical reaction prediction. The paper also emphasizes that existing benchmark datasets may be insufficient to fully evaluate the advantages of graph neural networks and provides a collection of easy-to-use and extensible molecular set representation architectures.

Molecular set representation learning

An Image-enhanced Molecular Graph Representation Learning Framework

Chemical-Reaction-Aware Molecule Representation Learning

Graph-based Molecular Representation Learning

Molecular contrastive learning of representations via graph neural networks

Molecular CT: Unifying Geometry and Representation Learning for Molecules at Different Scales

Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs

Deep Molecular Representation Learning via Fusing Physical and Chemical Information

Self-Supervised Contrastive Molecular Representation Learning with a Chemical Synthesis Knowledge Graph

Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

Analyzing Learned Molecular Representations for Property Prediction

Molecular representations for machine learning applications in chemistry

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

A review of molecular representation in the age of machine learning

MolSets: Molecular Graph Deep Sets Learning for Mixture Property Modeling

Multi-Modal Representation Learning for Molecular Property Prediction: Sequence, Graph, Geometry

MoleculeNet: A Benchmark for Molecular Machine Learning

Substructure-Atom Cross Attention for Molecular Representation Learning