Abstract:Condensing the many physical variables defining a chemical system into a fixed-size array poses a significant challenge in the development of chemical Machine Learning (ML). Atom Centered Symmetry Functions (ACSFs) offer an intuitive featurization approach by means of a tedious and labor-intensive selection of tunable parameters. In this work, we implement an unsupervised ML strategy relying on a Gaussian Mixture Model (GMM) to automatically optimize the ACSF parameters. GMMs effortlessly decompose the vastness of the chemical and conformational spaces into well-defined radial and angular clusters, which are then used to build tailor-made ACSFs. The unsupervised exploration of the space has demonstrated general applicability across a diverse range of systems, spanning from various unimolecular landscapes to heterogeneous databases. The impact of the sampling technique and temperature on space exploration is also addressed, highlighting the particularly advantageous role of high-temperature Molecular Dynamics (MD) simulations. The reliability of the resulting features is assessed through the estimation of the atomic charges of a prototypical capped amino acid and a heterogeneous collection of CHON molecules. The automatically constructed ACSFs serve as high-quality descriptors, consistently yielding typical prediction errors below 0.010 electrons bound for the reported atomic charges. Altering the spatial distribution of the functions with respect to the cluster highlights the critical role of symmetry rupture in achieving significantly improved features. More specifically, using two separate functions to describe the lower and upper tails of the cluster results in the best performing models with errors as low as 0.006 electrons. Finally, the effectiveness of finely tuned features was checked across different architectures, unveiling the superior performance of Gaussian Process (GP) models over Feed Forward Neural Networks (FFNNs), particularly in low-data regimes, with nearly a 2-fold increase in prediction quality. Altogether, this approach paves the way toward an easier construction of local chemical descriptors, while providing valuable insights into how radial and angular spaces should be mapped. Finally, this work opens the possibility of encoding many-body information beyond angular terms into upcoming ML features.

Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets

Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning

Splitting chemical structure data sets for federated privacy-preserving machine learning

Optimizing Training Data Set for the Machine Learning Potential of Li-Si Alloys via Structural Similarity-based Screening

Dataset Construction to Explore Chemical Space with 3D Geometry and Deep Learning

An Unsupervised Machine Learning Approach for the Automatic Construction of Local Chemical Descriptors

Learning to Group Auxiliary Datasets for Molecule

Machine learning of molecular properties: locality and active learning

Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Federated learning of molecular properties with graph neural networks in a heterogeneous setting

Designing compact training sets for data-driven molecular property prediction through optimal exploitation and exploration

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

Molecular set representation learning

Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS

Transfer Learning for Molecular Property Predictions from Small Data Sets

MoleculeNet: A Benchmark for Molecular Machine Learning

Systematic Evaluation of Local and Global Machine Learning Models for the Prediction of ADME Properties

Extracting Predictive Representations from Hundreds of Millions of Molecules

Small molecule machine learning: All models are wrong, some may not even be useful

MolSets: Molecular Graph Deep Sets Learning for Mixture Property Modeling