Informative Training Data for Efficient Property Prediction in Metal-Organic Frameworks by Active Learning

Roberta Poloni,Ashna Jose,Emilie Devijver,Noel Jakse
DOI: https://doi.org/10.26434/chemrxiv-2023-sw9kv
2023-12-01
Abstract:In recent data-driven approaches to materials discov- ery, scenarios where target quantities are expensive to compute or measure are often overlooked. In such cases, it becomes imperative to construct a training set that includes the most diverse, representative, and informative samples. Here, a novel regression tree-based active learning algorithm is employed for such a purpose. It is applied to predict band gap and adsorption properties of metal-organic frameworks (MOFs), a novel class of materials that results from the virtually infinite combinations of their building units. Simpler and low dimensional descrip- tors, such as the Stoichiometric-120 and geometric properties, found here to better represent MOFs in the low data regime, are used to compute the feature space for this model. The partition given by a regression tree constructed on the labeled part of the dataset is used to select new samples to be added to the training set, thereby limiting its size while maximizing the prediction quality. Through tests on the QMOF, hMOF, and dMOF data sets, we show that our method is effective in constructing small training data sets to learn regression models that predict well the target properties, thus reducing the label- ing cost. Specifically, our active learning approach is highly beneficial when labels are unevenly distributed in the descriptor space and when the label distribution is imbalanced, which is often the case for real world data. This offers a unique tool to efficiently analyze complex structure-property relationships in materials and accelerate materials discovery.
Chemistry
What problem does this paper attempt to address?