Systematic Improvement of Molecular Representations for Machine Learning Models

Bing Huang,O. Anatole von Lilienfeld
2016-01-01
Abstract:The predictive accuracy of Machine Learning (ML) models of molecular properties depends on the choice of the molecular representation. We introduce a hierarchy of representations based on uniqueness and target similarity criteria. To systematically control target similarity, we rely on interatomic many body expansions including {\underline B}onding, {\underline A}ngular, and higher order terms (BA). Addition of higher order contributions systematically increases similarity to the potential energy function as well as predictive accuracy of the resulting ML models. Numerical evidence is presented for the performance of BAML models trained on molecular properties pre-calculated at electron-correlated and density functional theory level of theory for thousands of small organic molecules. Properties studied include enthalpies and free energies of atomization, heatcapacity, zero-point vibrational energies, dipole-moment, polarizability, HOMO/LUMO energies and gap, ionization potential, electron affinity, and electronic excitations. After training, BAML enables predictions of energies or electronic properties of out-of-sample molecules with unprecedented accuracy and speed.
What problem does this paper attempt to address?