MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities

Ali Riza Durmaz,Akhil Thomas,Lokesh Mishra,Rachana Niranjan Murthy,Thomas Straub
2024-08-06
Abstract:While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
Computation and Language,Materials Science
What problem does this paper attempt to address?
The paper aims to address the issue of text mining in the field of materials mechanics, particularly focusing on literature related to material fatigue. Specifically, the researchers have constructed a dataset named MaterioMiner and an associated materials mechanics ontology, which are used to extract detailed information such as material composition, processing procedures, defect distribution, and performance from the literature. The main contributions of this dataset include: 1. **Fine-grained Annotation**: The dataset contains 179 different categories, with 2,191 entities manually annotated by three annotators across four publications. 2. **Ontology Integration**: The proposed dataset is combined with a custom materials mechanics ontology, which can be used to describe concepts in the field of materials science and link entities within the text corpus. 3. **Demonstrative Application**: The study also demonstrates how to fine-tune pre-trained models for named entity recognition tasks and explores the consistency among different annotators. Additionally, the paper discusses how these datasets can facilitate the training of material language models, automated ontology construction, and the generation of knowledge graphs from textual data. The overall goal is to advance the development of neuro-symbolic AI technologies in the field of materials science, enabling effective extraction and analysis of information such as material performance.