Abstract:This paper explores two quantum mechanics datasets (QM7b, QM9) composed of several thousands of organic molecules and described in terms of electronic properties, with a combination of unsupervised and supervised machine learning methods. One objective is to understand the internal structure and characteristics of this kind of data. The other is to present inverse molecular design, machine learning based approaches to predict the atomic composition of molecules from their physical‐chemical properties. High‐quality models were found. This paper (i) explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties, and (ii) further explores an inverse design approach to molecular design consisting of using machine learning methods to approximate the atomic composition of molecules, using QM9 data. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from physical‐chemical properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well‐defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner, core region that concentrates clustered inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inliner nature. The spatial structure exhibits a relationship with molecular weight. Despite the structural differences between the two datasets, the predictability of variables of interest for inverse molecular design is high. This is exemplified by models estimating the number of atoms of the molecule from both the original properties and from lower dimensional embedding spaces. In the generative approach the input is given by a set of desired properties of the molecule and the output is an approximation of the atomic composition in terms of its constituent chemical elements. This could serve as the starting region for further search in the huge space determined by the set of possible chemical compounds. The quantum mechanic's dataset QM9 is used in the study, composed of 133,885 small organic molecules and 19 electronic properties. Different multi‐target regression approaches were considered for predicting the atomic composition from the properties, including feature engineering techniques in an auto‐machine learning framework. High‐quality models were found that predict the atomic composition of the molecules from their electronic properties, as well as from a subset of only 52.6% size. Feature selection worked better than feature generation. The results validate the generative approach to inverse molecular design.

Extraction of information about the molecule structure directly from GC-MS data

Machine Learning Spectroscopy Based on Group Contribution and Molecule Contribution Methods

A Machine Learning Protocol for Geometric Information Retrieval from Molecular Spectra

Comparative Prediction of Gas Chromatographic Retention Indices for GC/MS Identification of Chemicals Related to Chemical Weapons Convention by Incremental and Machine Learning Methods

Ready‐to‐use Models Built Using a Diverse Set of 266 Aroma Compounds for the Estimation of Gas Chromatographic Retention Indices for the 50%‐Cyanopropylphenyl‐50%‐Dimethylpolysiloxane Stationary Phase

Auto-deconvolution and molecular networking of gas chromatography–mass spectrometry data

Headspace – Solid phase microextraction vs liquid injection GC-MS analysis of essential oils: Prediction of linear retention indices by multiple linear regression

Electron ionization mass spectrometry feature peak relationships combined with deep classification model to assist similarity algorithm for fast and accurate identification of compounds

Predicting structural groups of small molecules from 1H NMR spectral features using common machine learning classifiers

Direct Prediction of Physicochemical Properties and Toxicities of Chemicals from Analytical Descriptors by GC–MS

De-novo Identification of Small Molecules from Their GC-EI-MS Spectra

Accurate and efficient structure elucidation from routine one-dimensional NMR spectra using multitask machine learning

Leveraging infrared spectroscopy for automated structure elucidation

Can Graph Machines Accurately Estimate 13C NMR Chemical Shifts of Benzenic Compounds?

Finding features - variable extraction strategies for dimensionality reduction and marker compounds identification in GC-IMS data

Automated compound speciation, cluster analysis, and quantification of organic vapors and aerosols using comprehensive two-dimensional gas chromatography and mass spectrometry

Machine learning prediction of the most intense peak of the absorption spectra of organic molecules

QC-GN2oMS2: a Graph Neural Net for High Resolution Mass Spectra Prediction

Machine Learning of Molecular Electronic Properties in Chemical Compound Space

Novel machine learning insights into the QM7b and QM9 quantum mechanics datasets

Enhancing Molecular Structure Elucidation: MultiModalTransformer for both simulated and experimental spectra