MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities

Ali Riza Durmaz,Akhil Thomas,Lokesh Mishra,Rachana Niranjan Murthy,Thomas Straub

2024-08-06

Abstract:While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.

Computation and Language,Materials Science

What problem does this paper attempt to address?

The paper aims to address the issue of text mining in the field of materials mechanics, particularly focusing on literature related to material fatigue. Specifically, the researchers have constructed a dataset named MaterioMiner and an associated materials mechanics ontology, which are used to extract detailed information such as material composition, processing procedures, defect distribution, and performance from the literature. The main contributions of this dataset include: 1. **Fine-grained Annotation**: The dataset contains 179 different categories, with 2,191 entities manually annotated by three annotators across four publications. 2. **Ontology Integration**: The proposed dataset is combined with a custom materials mechanics ontology, which can be used to describe concepts in the field of materials science and link entities within the text corpus. 3. **Demonstrative Application**: The study also demonstrates how to fine-tune pre-trained models for named entity recognition tasks and explores the consistency among different annotators. Additionally, the paper discusses how these datasets can facilitate the training of material language models, automated ontology construction, and the generation of knowledge graphs from textual data. The overall goal is to advance the development of neuro-symbolic AI technologies in the field of materials science, enabling effective extraction and analysis of information such as material performance.

MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities

An ontology-based text mining dataset for extraction of process-structure-property entities

Matminer: an Open Source Toolkit for Materials Data Mining

MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain

Ontology-conformal recognition of materials entities using language models

Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

MatText: Do Language Models Need More than Text & Scale for Materials Modeling?

Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction

Agent-based Learning of Materials Datasets from Scientific Literature

MatNexus: A Comprehensive Text Mining and Analysis Suite for Materials Discover

DigiMOF: A Database of Metal-Organic Framework Synthesis Information Generated via Text Mining

Constructing a Semantic Data Model for the Field of Material Science through Ontology

From Text to Insight: Large Language Models for Materials Science Data Extraction

A novel digitalization approach for smart materials – ontology‐based access to data and models

MatNexus: A comprehensive text mining and analysis suite for materials discovery

MaScQA: Investigating Materials Science Knowledge of Large Language Models

Polymetis:Large Language Modeling for Multiple Material Domains

Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning

Introducing MAMBO: Materials And Molecules Basic Ontology

Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning