A GPT-assisted iterative method for extracting domain knowledge from a large volume of literature of electromagnetic wave absorbing materials with limited manually annotated data

Dongbo Dai,Guangjie Zhang,Xiao Wei,Yudian Lin,Mengmeng Dai,Junjie Peng,Na Song,Zheng Tang,Shengzhou Li,Jiwei Liu,Yan Xu,Renchao Che,Huiran Zhang
DOI: https://doi.org/10.1016/j.commatsci.2024.113431
IF: 3.572
2024-10-07
Computational Materials Science
Abstract:Research on electromagnetic wave absorbing materials is an important part of materials science. Each year, a substantial amount of academic literature is published in this field, containing critical information. Rapid and effective knowledge extraction from these documents is key to accelerating field development, and automated knowledge extraction based on deep learning provides a solution to this challenge. However, deep learning models typically require extensive annotated data for training, which is time-consuming and expensive to obtain in highly specialized subfields. To address this issue, this paper presents a GPT-assisted iterative training method that uses only 30 manually annotated literature abstracts as a training set and ultimately achieves an F1 score of 82.94% for a named entity recognition model (NER). The effectiveness of this model is demonstrated by comparing with other large language models commonly used in materials science on a custom dataset. We constructed a knowledge extraction framework centered around the obtained NER model and collected literature on electromagnetic wave absorbing materials from the last decade. The extraction and application results demonstrate the practicality of our framework.
materials science, multidisciplinary
What problem does this paper attempt to address?