Shape is (almost) all!: Persistent homology features (PHFs) are an information rich input for efficient molecular machine learning

Ella Gale

2023-04-15

Abstract:3-D shape is important to chemistry, but how important? Machine learning works best when the inputs are simple and match the problem well. Chemistry datasets tend to be very small compared to those generally used in machine learning so we need to get the most from each datapoint. Persistent homology measures the topological shape properties of point clouds at different scales and is used in topological data analysis. Here we investigate what persistent homology captures about molecular structure and create persistent homology features (PHFs) that encode a molecule's shape whilst losing most of the symbolic detail like atom labels, valence, charge, bonds etc. We demonstrate the usefulness of PHFs on a series of chemical datasets: QM7, lipophilicity, Delaney and Tox21. PHFs work as well as the best benchmarks. PHFs are very information dense and much smaller than other encoding methods yet found, meaning ML algorithms are much more energy efficient. PHFs success despite losing a large amount of chemical detail highlights how much of chemistry can be simplified to topological shape.

Machine Learning,Disordered Systems and Neural Networks,General Topology

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily explores how to describe molecular structures using topological data analysis methods (specifically persistent homology) and apply them in molecular machine learning. Specifically, the paper aims to address the following issues: 1. **Importance of Molecular Shape**: - 3-D shape is crucial in chemistry, especially in aspects such as drug activity, chirality, optical properties, electronic structure, and reactivity. Therefore, studying how to effectively encode the three-dimensional shape of molecules is particularly important for machine learning algorithms. 2. **Efficient Feature Representation**: - Chemical datasets are usually small, so it is necessary to extract as much information as possible from each data point. Persistent Homology Features (PHFs) can efficiently encode molecular shape information while ignoring symbolic details such as atomic labels and bonding. This makes PHFs a very compact and information-rich feature representation method. 3. **Validation of PHFs' Effectiveness**: - The paper validates the effectiveness of PHFs on a series of chemical datasets, including QM7, lipophilicity, Delaney, and Tox21 datasets. The results show that the performance of PHFs is comparable to the current best benchmark methods. 4. **Exploring Topological Methods in Chemistry**: - Persistent homology, as a new topological analysis tool, has not been widely applied in the field of chemistry. The paper provides a detailed introduction to the basic concepts of persistent homology and its applications in chemistry, demonstrating its potential in molecular machine learning. 5. **Simplification of Chemical Problems**: - The paper emphasizes that even by ignoring a large amount of chemical detail, many chemical problems can be solved solely through topological shape information. This indicates that many issues in chemistry can be simplified by using topological shapes. Through the above work, the paper aims to demonstrate the advantages of persistent homology features in molecular machine learning and promote the application of topological data analysis methods in the field of chemistry.

Shape is (almost) all!: Persistent homology features (PHFs) are an information rich input for efficient molecular machine learning

On the effectiveness of persistent homology

Representation of molecular structures with persistent homology for machine learning applications in chemistry

SE(3)-Invariant Multiparameter Persistent Homology for Chiral-Sensitive Molecular Property Prediction

Multiparameter Persistent Homology for Molecular Property Prediction

Molecular shape as a (useful) bias in chemistry

Molecular Shape and Medicinal Chemistry: A Perspective

Persistent-Homology-based Machine Learning and its Applications -- A Survey

Persistent homology-based descriptor for machine-learning potential of amorphous structures

Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction

Persistent homology-based descriptor for machine-learning potential

Mayer-homology learning prediction of protein-ligand binding affinities

Persistent homology analysis of ion aggregation and hydrogen-bonding network

An Algorithm for Persistent Homology Computation Using Homomorphic Encryption

How the Shape of Chemical Data Can Enable Data-Driven Materials Discovery

A physics-inspired approach to the understanding of molecular representations and models

Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

Pull-back Geometry of Persistent Homology Encodings

Tight basis cycle representatives for persistent homology of large data sets

Persistent homology analysis of protein structure, flexibility and folding