AutoIE: An Automated Framework for Information Extraction from Scientific Literature

Yangyang Liu,Shoubin Li
2024-01-30
Abstract:In the rapidly evolving field of scientific research, efficiently extracting key information from the burgeoning volume of scientific papers remains a formidable challenge. This paper introduces an innovative framework designed to automate the extraction of vital data from scientific PDF documents, enabling researchers to discern future research trajectories more readily. AutoIE uniquely integrates four novel components: (1) A multi-semantic feature fusion-based approach for PDF document layout analysis; (2) Advanced functional block recognition in scientific texts; (3) A synergistic technique for extracting and correlating information on molecular sieve synthesis; (4) An online learning paradigm tailored for molecular sieve literature. Our SBERT model achieves high Marco F1 scores of 87.19 and 89.65 on CoNLL04 and ADE datasets. In addition, a practical application of AutoIE in the petrochemical molecular sieve synthesis domain demonstrates its efficacy, evidenced by an impressive 78\% accuracy rate. This research paves the way for enhanced data management and interpretation in molecular sieve synthesis. It is a valuable asset for seasoned experts and newcomers in this specialized field.
Information Retrieval,Artificial Intelligence,Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
The paper attempts to address the challenge of efficiently and accurately extracting key information from scientific research literature. With the rapid growth in the number of scientific papers, traditional information extraction methods struggle to cope with the increasing scale and complexity of data, leading to a decline in their efficiency and accuracy. To this end, the paper proposes an automated framework called AutoIE, which aims to solve these issues through the following four innovative components: 1. **PDF Document Layout Analysis Method with Multi-Semantic Feature Fusion**: Used for quickly locating information positions within documents. 2. **Advanced Functional Block Recognition Technology**: Identifies key sections in scientific texts. 3. **Collaborative Technology for Molecular Sieve Synthesis Information Extraction and Association**: Specifically targets information extraction in the field of molecular sieve synthesis. 4. **Online Learning Paradigm for Molecular Sieve Literature**: Addresses issues such as corpus scarcity, expert availability, and sample annotation in specific domains. Through these components, the AutoIE framework can quickly and accurately extract key information from scientific literature, thereby helping researchers better grasp future research directions. The paper also demonstrates the application of AutoIE in the field of molecular sieve synthesis, proving its superior performance in handling complex data environments.