A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

Pranav Shetty,Arunkumar Chitteth Rajan,Christopher Kuenneth,Sonkakshi Gupta,Lakshmi Prerana Panchumarti,Lauren Holm,Chao Zhang,Rampi Ramprasad
DOI: https://doi.org/10.1038/s41524-023-01003-w
2022-09-27
Abstract:The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at <a class="link-external link-https" href="https://polymerscholar.org" rel="external noopener nofollow">this https URL</a> which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.
Computation and Language,Materials Science,Soft Condensed Matter
What problem does this paper attempt to address?
This paper attempts to solve the problem of difficult extraction of material property data in materials science literature, especially for polymer materials. As the number of materials science articles increases year by year, it becomes more and more difficult to extract chemical - structure - property relationships from these literatures. Traditional manual extraction methods are time - consuming and inefficient and cannot meet the rapidly developing scientific research needs. ### Specific Problems and Solutions 1. **Problem Description**: - The number of materials science literatures grows at a compound annual growth rate of 6%. - A large amount of material property information is locked in literatures in the form of natural language and is difficult to be directly machine - read and analyzed. - Searching for material systems with specific properties becomes more difficult. - Due to the lack of machine - readable data forms, the field of materials informatics faces the problem of data scarcity, and training property prediction models requires a large amount of time for manual data collation. 2. **Solution**: - Use natural language processing (NLP) technology to automatically extract material property data from literatures. - Construct a general material property data extraction pipeline that can handle large - scale polymer literatures. - Train a language model named MaterialsBERT, and use 2.4 million materials science abstracts for fine - tuning, which significantly improves the performance of the named entity recognition (NER) task. - Through this pipeline, about 300,000 material property records are extracted from about 130,000 abstracts, which is completed in only 60 hours. ### Main Contributions 1. **Constructing a General Pipeline**: - Propose an automated process from published literatures to fully extract material property information. - This is the first time to construct a general data extraction pipeline applicable to any material property. 2. **Improving Extraction Efficiency**: - Use MaterialsBERT as an encoder, which significantly improves the performance of the NER task and outperforms other similar models such as BioBERT, ChemBERT, etc. - It performs better than other pre - trained language models in three materials science NER datasets. 3. **Application and Verification**: - The extracted data covers a variety of application scenarios, such as fuel cells, supercapacitors and polymer solar cells. - By analyzing the extracted data, multiple known materials science trends and phenomena are successfully reproduced, verifying the effectiveness of the method. 4. **Providing a Convenient Tool**: - Provide a web - based interface (https://polymerscholar.org) to facilitate researchers to find material property data of interest. ### Conclusion This paper shows how to effectively use NLP technology to solve the difficult problem of material property data extraction in materials science literatures, providing new ideas and tools for the development of materials informatics.