Alarming structural error rates in MOF databases used in data driven workflows identified via a novel metal oxidation state-based method

Tom Woo,Andrew White,Jake Burner,Marco Gibaldi,R. Alex Mayo
DOI: https://doi.org/10.26434/chemrxiv-2024-ftsv3
2024-10-10
Abstract:Metal-organic frameworks (MOFs) are a diverse class of porous materials composed of inorganic nodes joined by organic linkers, currently under investigation for a wide range of applications including gas storage and separation where they have been commercialized. Given the labor-intensive nature of synthesizing and testing individual MOFs, high-throughput computational screening and machine learning (ML) methods are increasingly viewed as essential for facilitating MOF development. However, the structural fidelity of the “computation-ready” MOF databases used in such studies remains largely unquantified. We introduce MOSAEC, an algorithm that detects chemically invalid structures on the basis of metal oxidation states. MOSAEC was manually validated against ~16k MOF structures from the popular CoRE database, and was found to flag erroneous structures with 95% accuracy. Systematic examination of 14 leading experimental and hypothetical MOF databases containing >1.9 million MOFs reveals concerning structural error rates, exceeding 40% in most cases.
Chemistry
What problem does this paper attempt to address?
This paper aims to solve the problem of high structural error rates in the metal - organic frameworks (MOFs) database. Specifically, the paper points out that the structural fidelity in the current MOF databases used for data - driven workflows has not been fully quantified yet. The authors introduce a new algorithm MOSAEC (Metal Oxidation State Automated Error Checker) based on the metal oxidation state, which can detect chemically invalid structures based on the metal oxidation state. By manually verifying approximately 16,000 MOF structures from the popular CoRE database, MOSAEC was found to be able to mark the wrong structures with 95% accuracy. Fourteen major experimental and hypothetical MOF databases (containing more than 1.9 million MOFs) were systematically examined, and the results show that the structural error rates in most databases exceed 40%. This finding seriously questions a large number of studies relying on these "computation - ready" databases and emphasizes the harm of using high - error - rate databases without modification in subsequent computational studies.