A systematic and general machine learning approach to build a consistent data set from different experiments

Itamar Borges,Matheus Máximo-Canadas,Julio Cesar Duarte,Jakler Nichele,Leonardo Alves,Rogerio Ramos,Luiz Octavio Pereira
DOI: https://doi.org/10.26434/chemrxiv-2024-g6bsl
2024-07-10
Abstract:Experimental data from different sources present challenges due to variability and noise from various experimental conditions, apparatuses, and environmental factors. In this work, we propose a general method to address these challenges to build a consistent data set employing different thermal conductivity experimental data sets of methane from the liquid, vapor, and supercritical phases. Methane is a key hydrocarbon with extensive industrial and environmental applications. The method is based on machine learning (ML) techniques, which are used to consistently integrate data from various experimental sources compiled by the National Institute of Standards and Technology (NIST) database. Different ML algorithms are used for this purpose. Our findings indicate that ML models trained on raw experimental data yield predictions closer to the NIST’s processed data than the original raw experimental data, thus demonstrating the models’ ability to generalize from heterogenous, noisy, and untreated data sets. The proposed ML approach is general and efficient in handling complex and heterogeneous data to deliver reliable predictions without extensive preprocessing.
Chemistry
What problem does this paper attempt to address?
This paper mainly investigates how to use machine learning (ML) methods to construct a consistent dataset to handle thermal conductivity data from different experiments. The research team focuses on thermal conductivity data of methane (a crucial hydrocarbon compound in industrial and environmental applications) in liquid, gas, and supercritical states. Due to the possible variations, noise, and challenges caused by different experimental conditions, devices, and environmental factors, they propose a general approach to integrate data from various experimental sources compiled in the National Institute of Standards and Technology (NIST) database using ML techniques. The paper points out that although the original experimental data may contain noise, the trained ML models can generate predictions closer to the NIST-processed data, demonstrating the ability of the models to learn intrinsic patterns from non-uniform and unprocessed datasets. The researchers utilize various ML algorithms and employ a decision tree model to classify the physical states of experimental data (liquid, gas, or supercritical) based on temperature and pressure variables. The results indicate that the predictions of the ML models are more consistent with the NIST-processed data compared to the original experimental data, suggesting the effectiveness and reliability of this method for handling complex and heterogeneous datasets without extensive preprocessing steps. This provides a new approach for establishing reliable and consistent thermal property datasets, especially in applications requiring accurate thermal management, such as various methane applications. In summary, the problem addressed in this paper is how to utilize machine learning techniques to create a consistent and reliable thermal conductivity dataset from experimental data from different sources, particularly for essential substances like methane, which is of significant importance in industrial process optimization, energy efficiency improvement, and environmental impact assessment.