MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature

Gyeong Hoon Yi,Jiwoo Choi,Hyeongyun Song,Olivia Miano,Jaewoong Choi,Kihoon Bang,Byungju Lee,Seok Su Sohn,David Buttler,Anna Hiszpanski,Sang Soo Han,Donghun Kim
2024-06-08
Abstract:Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieved an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot and fine-tuning, we present a Pareto-front mapping where the few-shot learning method was found to be the most balanced solution owing to both its high extraction accuracy (total F1 score>95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.
Computation and Language
What problem does this paper attempt to address?
The problem addressed in the paper is how to efficiently extract data from tables in materials science literature. Existing methods are not suitable for rule-based extraction due to the diversity of table formats. The paper introduces MaTableGPT, a table data extractor based on GPT, specifically designed for handling tables in materials science literature. MaTableGPT adopts table data representation and table splitting strategies to enhance GPT's understanding of tables and filters out false information through subsequent question filtering. In the large-scale application of hydrolytic catalysis literature, MaTableGPT achieves extraction accuracy of up to 96.8%. The paper also evaluates the cost and accuracy of learning methods such as zero-shot, few-shot, and fine-tuning, and finds that few-shot learning is the most balanced solution, with high accuracy (overall F1 score > 95%) and low cost (GPT usage cost of $5.97, annotation cost of only 10 I/O pairs). Statistical analysis of the database generated by MaTableGPT reveals the distribution of overpotentials and element utilization in hydrolytic catalysts.