NLP meets Materials Science: Quantifying the presentation of materials data in scientific literature

Hasan M. Sayeed,Wade Smallwood,Sterling G. Baird,Taylor D. Sparks
DOI: https://doi.org/10.26434/chemrxiv-2023-wd5cr-v3
2023-12-29
Abstract:The recent and sudden emergence of Large Language Models have profoundly changed the landscape around how we approach and interact with information. Materials Science, given its highly complex and multifaceted nature, is a space we intend for Natural Language Processing to absolutely flip the script related to progress, learning, and especially new materials discovery, as a result of enhanced data accessibility. We explore the underlying patterns and structures of data expression across a number of randomly selected materials science papers, annotating relevant data by type (category) and source (channel) as a starting point to future Materials Science specific information extraction and LLM development.
Chemistry
What problem does this paper attempt to address?
The problem discussed in this paper is how to improve data extraction and access in the field of materials science using natural language processing (NLP) techniques. Currently, although materials science lags behind other physical sciences in terms of data-driven discoveries, a large amount of material composition, preparation conditions, and performance attribute information is embedded in academic literature. However, this information is scattered in text, tables, and graphs, making data extraction and analysis challenging. The paper aims to lay the foundation for future development of specific information extraction and large language models (LLMs) targeting materials science by studying the patterns and structures of data representation in materials science papers, in order to improve data accessibility and the efficiency of new material discovery. The research methodology includes manually annotating data types and sources in randomly selected materials science papers, aiming to reveal relationships between data sources and provide insights for more effective data extraction methods.