Machine Learning in Complex Organic Mixtures: Applying Domain Knowledge Allows for Meaningful Performance with Small Datasets.

Jeffrey F. Van Humbeck,Katelyn Le,Jagoš R. Radović,Justin L. MacCallum,Stephen R. Larter
DOI: https://doi.org/10.26434/chemrxiv-2024-pz45l
2024-04-18
Abstract:The ability to quantify individual components of complex mixtures is a challenge found throughout the life and physical sciences. An improved capacity to generate large datasets along with the uptake of machine-learning (ML) based analysis tools has allowed for various ‘omics’ disciplines to realize exceptional advances. Other areas of chemistry that deal with complex mixtures often cannot leverage these advances. Environmental samples, for example, can be more difficult to access and the resulting small datasets are less appropriate for unconstrained ML approaches. Herein, we present an approach to address this latter issue. Using a very small environmental dataset—35 high-resolution mass spectra gathered from various solvent extractions of Canadian petroleum fractions—we show that the application of specific domain knowledge can lead to ML models with notable performance.
Chemistry
What problem does this paper attempt to address?
This paper discusses how to utilize a limited dataset and machine learning (ML) methods to achieve meaningful performance in the analysis of complex organic mixtures. The study focuses on situations where environmental samples, which are difficult to obtain and often have small datasets, are not suitable for unconstrained machine learning strategies. The paper demonstrates how to apply domain-specific knowledge to construct machine learning models with significant performance by focusing on asphaltenes, a complex mixture found in petroleum samples. Asphaltenes cause problems in petroleum processing and transportation due to their tendency to form deposits. Analyzing their molecular composition is challenging due to their complex composition, including polycyclic aromatic structures and functional groups. The authors of the paper analyze the solvent extracts of two Canadian petroleum samples using high-resolution mass spectrometry (HRMS) and create a small dataset containing 35 samples. They find that by understanding specific molecular relationships, they can guide the construction of machine learning models to predict variations in the solvent extraction process. In the paper, the researchers employ a "masked ion" strategy by training a lightweight neural network to predict the hidden ion intensities based on observed neighboring molecular relationships. This allows them to restrict the machine learning method to focus on meaningful relationships rather than searching for random correlations in the dataset. The results demonstrate that machine learning methods combined with domain knowledge exhibit good accuracy in predicting the behavior of complex mixtures. The paper concludes by stating that this approach is not only applicable to petroleum chemistry but may also be beneficial in other domains with complex HRMS datasets and explicit chemical relationships, such as marine dissolved organic matter analysis. It emphasizes the importance of combining domain knowledge with machine learning analysis under limited data conditions.