SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model

Xingbo Wang,Samantha L. Huey,Rui Sheng,Saurabh Mehta,Fei Wang
2024-04-22
Abstract:Extraction and synthesis of structured knowledge from extensive scientific literature are crucial for advancing and disseminating scientific progress. Although many existing systems facilitate literature review and digest, they struggle to process multimodal, varied, and inconsistent information within and across the literature into structured data. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that enables researchers to efficiently build structured knowledge bases from scientific literature at scale. The system automatically creates data tables to organize and summarize users' interested knowledge in literature via question-answering. Furthermore, it provides multi-level and multi-faceted exploration of the generated data tables, facilitating iterative validation, correction, and refinement. Our within-subjects study with researchers demonstrates the effectiveness and efficiency of SciDaSynth in constructing quality scientific knowledge bases. We further discuss the design implications for human-AI interaction tools for data extraction and structuring.
Human-Computer Interaction
What problem does this paper attempt to address?
The paper presents an interactive system called SciDaSynth, which utilizes Large Language Models (LLMs) to extract and synthesize structured knowledge from a large amount of scientific literature. The system automatically creates data tables through a question-and-answer approach to organize and summarize the knowledge of interest to users. It provides multi-level and multi-faceted exploration of the generated data tables to facilitate iterative validation, calibration, and refinement. The researchers conducted an experiment that demonstrated the effectiveness and efficiency of SciDaSynth in building high-quality scientific knowledge bases. The paper also discusses insights for the design of future human-computer interaction tools in data extraction and structuring. The main challenge is converting non-structured knowledge from literature into structured data, which existing systems face difficulty in. SciDaSynth aims to address this problem through LLMs while allowing researchers to provide supervision and corrections to ensure the accuracy and reliability of the knowledge.