A materials terminology knowledge graph automatically constructed from text corpus

Yuwei Zhang,Fangyi Chen,Zeyi Liu,Yunzhuo Ju,Dongliang Cui,Jinyi Zhu,Xue Jiang,Xi Guo,Jie He,Lei Zhang,Xiaotong Zhang,Yanjing Su
DOI: https://doi.org/10.1038/s41597-024-03448-0
2024-06-08
Scientific Data
Abstract:A scalable, reusable, and broad-coverage unified material knowledge representation shows its importance and will bring great benefits to data sharing among materials communities. A knowledge graph (KG) for materials terminology, which is a formal collection of term entities and relationships, is conceptually important to achieve this goal. In this work, we propose a KG for materials terminology, named Materials Genome Engineering Database Knowledge Graph (MGED-KG), which is automatically constructed from text corpus via natural language processing. MGED-KG is the most comprehensive KG for materials terminology in both Chinese and English languages, consisting of 8,660 terms and their explanations. It encompasses 11 principal categories, such as Metals, Composites, Nanomaterials, each with two or three levels of subcategories, resulting in a total of 235 distinct category labels. For further application, a knowledge web system based on MGED-KG is developed and shows its great power in improving data sharing efficiency from the aspects of query expansion, term, and data recommendation.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper aims to address the construction and application of terminology knowledge graphs (KG) in the field of materials science. Specifically, the research team proposed a novel knowledge graph called the "Materials Genome Engineering Database Knowledge Graph" (MGED-KG), which is automatically constructed from text corpora using natural language processing techniques. MGED-KG aims to standardize the representation of material terms, improve data sharing efficiency, and promote data interoperability among different materials science communities. The knowledge graph covers 8,660 terms and their explanations, including 11 main categories such as metals, composites, nanomaterials, etc. Each category is further subdivided into 2 to 3 levels of subcategories, totaling 235 different classification labels. Additionally, a knowledge network system based on MGED-KG was developed for query expansion, term recommendation, and data recommendation functions, significantly improving the efficiency and accuracy of data retrieval. These functions not only automatically complete user inputs but also recommend relevant terms and data instances based on user query needs, accelerating the discovery process of valuable data.