Abstract:The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at <a class="link-external link-https" href="https://polymerscholar.org" rel="external noopener nofollow">this https URL</a> which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.

What problem does this paper attempt to address?

This paper attempts to solve the problem of difficult extraction of material property data in materials science literature, especially for polymer materials. As the number of materials science articles increases year by year, it becomes more and more difficult to extract chemical - structure - property relationships from these literatures. Traditional manual extraction methods are time - consuming and inefficient and cannot meet the rapidly developing scientific research needs. ### Specific Problems and Solutions 1. **Problem Description**: - The number of materials science literatures grows at a compound annual growth rate of 6%. - A large amount of material property information is locked in literatures in the form of natural language and is difficult to be directly machine - read and analyzed. - Searching for material systems with specific properties becomes more difficult. - Due to the lack of machine - readable data forms, the field of materials informatics faces the problem of data scarcity, and training property prediction models requires a large amount of time for manual data collation. 2. **Solution**: - Use natural language processing (NLP) technology to automatically extract material property data from literatures. - Construct a general material property data extraction pipeline that can handle large - scale polymer literatures. - Train a language model named MaterialsBERT, and use 2.4 million materials science abstracts for fine - tuning, which significantly improves the performance of the named entity recognition (NER) task. - Through this pipeline, about 300,000 material property records are extracted from about 130,000 abstracts, which is completed in only 60 hours. ### Main Contributions 1. **Constructing a General Pipeline**: - Propose an automated process from published literatures to fully extract material property information. - This is the first time to construct a general data extraction pipeline applicable to any material property. 2. **Improving Extraction Efficiency**: - Use MaterialsBERT as an encoder, which significantly improves the performance of the NER task and outperforms other similar models such as BioBERT, ChemBERT, etc. - It performs better than other pre - trained language models in three materials science NER datasets. 3. **Application and Verification**: - The extracted data covers a variety of application scenarios, such as fuel cells, supercapacitors and polymer solar cells. - By analyzing the extracted data, multiple known materials science trends and phenomena are successfully reproduced, verifying the effectiveness of the method. 4. **Providing a Convenient Tool**: - Provide a web - based interface (https://polymerscholar.org) to facilitate researchers to find material property data of interest. ### Conclusion This paper shows how to effectively use NLP technology to solve the difficult problem of material property data extraction in materials science literatures, providing new ideas and tools for the development of materials informatics.

A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature

Automated pipeline for superalloy data by text mining

Analyzing Research Trends in Inorganic Materials Literature Using NLP

polyBERT: A chemical language model to enable fully machine-driven ultrafast polymer informatics

ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data

Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction

Open-source Polymer Generative Pipeline

High-Throughput Extraction of Phase–Property Relationships from Literature Using Natural Language Processing and Large Language Models

Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning

Automated, LLM enabled extraction of synthesis details for reticular materials from scientific literature

A literature-mining method of integrating text and table extraction for materials science publications

PolyIE: A Dataset of Information Extraction from Polymer Material Scientific Literature

PolyNC: a natural and chemical language model for unified polymer properties prediction

A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries

Agent-based Learning of Materials Datasets from Scientific Literature

Extracting accurate materials data from research papers with conversational language models and prompt engineering

PcMSP: A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text