Molecular set representation learning

Daniel Probst,Maria Boulougouri,Pierre Vandergheynst
DOI: https://doi.org/10.26434/chemrxiv-2023-fk7kf-v2
2024-03-15
Abstract:Computational representation of molecules can take many forms, including graphs, string-encodings of graphs, binary vectors, or learned embeddings in the form of real-valued vectors. These representations are then used in downstream classification and regression tasks using a wide range of machine-learning models. However, existing models come with limitations, such as the requirement for clearly defined chemical bonds, which often do not represent the true underlying nature of a molecule. Here, we propose a framework for molecular machine learning tasks based on set representation learning. We show that learning on sets of atomic invariants alone reaches the performance of state-of-the-art graph-based models on the most-used chemical benchmark data sets and that introducing a set representation layer into graph neural networks can surpass the performance of established methods in the domains of chemistry, biology, and material science. We introduce specialised set representation-based neural network architectures for reaction yield and protein-ligand binding affinity prediction. Overall, we show that the technique we denote molecular set representation learning is both an alternative and an extension to graph neural network architectures for machine learning tasks on molecules, molecule complexes, and chemical reactions.
Chemistry
What problem does this paper attempt to address?
The paper proposes a new molecular machine learning framework based on set representation learning instead of the traditional graph neural networks. Existing molecular representation methods such as graph encoding may have limitations and cannot fully reflect the true nature of molecules, especially non-covalent bonds and dynamic interactions. The researchers construct molecules through a set representation of atomic invariants, which does not rely on explicit chemical bond definitions but preserves the implicit information of molecular structure. They design several set-based neural network architectures and compare them with graph neural network models on multiple chemical tasks. The results show that set representation learning can not only compete with graph neural networks but also outperform them in certain cases, particularly in the fields of drug discovery, material science, and chemical reaction prediction. The paper also emphasizes that existing benchmark datasets may be insufficient to fully evaluate the advantages of graph neural networks and provides a collection of easy-to-use and extensible molecular set representation architectures.