MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature

Gyeong Hoon Yi,Jiwoo Choi,Hyeongyun Song,Olivia Miano,Jaewoong Choi,Kihoon Bang,Byungju Lee,Seok Su Sohn,David Buttler,Anna Hiszpanski,Sang Soo Han,Donghun Kim

2024-06-08

Abstract:Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieved an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot and fine-tuning, we present a Pareto-front mapping where the few-shot learning method was found to be the most balanced solution owing to both its high extraction accuracy (total F1 score>95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.

Computation and Language

What problem does this paper attempt to address?

The problem addressed in the paper is how to efficiently extract data from tables in materials science literature. Existing methods are not suitable for rule-based extraction due to the diversity of table formats. The paper introduces MaTableGPT, a table data extractor based on GPT, specifically designed for handling tables in materials science literature. MaTableGPT adopts table data representation and table splitting strategies to enhance GPT's understanding of tables and filters out false information through subsequent question filtering. In the large-scale application of hydrolytic catalysis literature, MaTableGPT achieves extraction accuracy of up to 96.8%. The paper also evaluates the cost and accuracy of learning methods such as zero-shot, few-shot, and fine-tuning, and finds that few-shot learning is the most balanced solution, with high accuracy (overall F1 score > 95%) and low cost (GPT usage cost of $5.97, annotation cost of only 10 I/O pairs). Statistical analysis of the database generated by MaTableGPT reveals the distribution of overpotentials and element utilization in hydrolytic catalysts.

MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature

Matminer: an Open Source Toolkit for Materials Data Mining

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

TableGPT2: A Large Multimodal Model with Tabular Data Integration

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Towards Development of Automated Knowledge Maps and Databases for Materials Engineering using Large Language Models

Table-GPT: Table-tuned GPT for Diverse Table Tasks

MaScQA: Investigating Materials Science Knowledge of Large Language Models

TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

Accelerated materials language processing enabled by GPT

Image and Data Mining in Reticular Chemistry Using GPT-4V

A literature-mining method of integrating text and table extraction for materials science publications

GPT-4 as an interface between researchers and computational software: improving usability and reproducibility

A GPT-assisted iterative method for extracting domain knowledge from a large volume of literature of electromagnetic wave absorbing materials with limited manually annotated data

GPTArticleExtractor: An Automated Workflow for Magnetic Material Database Construction