SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model

Xingbo Wang,Samantha L. Huey,Rui Sheng,Saurabh Mehta,Fei Wang

2024-04-22

Abstract:Extraction and synthesis of structured knowledge from extensive scientific literature are crucial for advancing and disseminating scientific progress. Although many existing systems facilitate literature review and digest, they struggle to process multimodal, varied, and inconsistent information within and across the literature into structured data. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that enables researchers to efficiently build structured knowledge bases from scientific literature at scale. The system automatically creates data tables to organize and summarize users' interested knowledge in literature via question-answering. Furthermore, it provides multi-level and multi-faceted exploration of the generated data tables, facilitating iterative validation, correction, and refinement. Our within-subjects study with researchers demonstrates the effectiveness and efficiency of SciDaSynth in constructing quality scientific knowledge bases. We further discuss the design implications for human-AI interaction tools for data extraction and structuring.

Human-Computer Interaction

What problem does this paper attempt to address?

The paper presents an interactive system called SciDaSynth, which utilizes Large Language Models (LLMs) to extract and synthesize structured knowledge from a large amount of scientific literature. The system automatically creates data tables through a question-and-answer approach to organize and summarize the knowledge of interest to users. It provides multi-level and multi-faceted exploration of the generated data tables to facilitate iterative validation, calibration, and refinement. The researchers conducted an experiment that demonstrated the effectiveness and efficiency of SciDaSynth in building high-quality scientific knowledge bases. The paper also discusses insights for the design of future human-computer interaction tools in data extraction and structuring. The main challenge is converting non-structured knowledge from literature into structured data, which existing systems face difficulty in. SciDaSynth aims to address this problem through LLMs while allowing researchers to provide supervision and corrections to ensure the accuracy and reliability of the knowledge.

SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model

ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models

LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis

SynAsk: Unleashing the Power of Large Language Models in Organic Synthesis

Large Language Models for Scientific Synthesis, Inference and Explanation

Large Language Models for Scientific Information Extraction: An Empirical Study for Virology

Structured information extraction from complex scientific text with fine-tuned large language models

Scientific Large Language Models: A Survey on Biological & Chemical Domains

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

An Autonomous Large Language Model Agent for Chemical Literature Data Mining

Structured information extraction from scientific text with large language models

A Review on Scientific Knowledge Extraction using Large Language Models in Biomedical Sciences

SciLit: A Platform for Joint Scientific Literature Discovery, Summarization and Citation Generation

Automated, LLM enabled extraction of synthesis details for reticular materials from scientific literature

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

SciScribe: Automating and contextualizing literature reviews in cardiac surgery

DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications

Interactive Distillation of Large Single-Topic Corpora of Scientific Papers

Validation of the Scientific Literature via Chemputation Augmented by Large Language Models