Sm-Nd Isotope Data Compilation from Geoscientific Literature Using an Automated Tabular Extraction Method

Zhixin Guo,Tao Wang,Chaoyang Wang,Jianping Zhou,Guanjie Zheng,Xinbing Wang,Chenghu Zhou
2024-03-27
Abstract:The rare earth elements Sm and Nd significantly address fundamental questions about crustal growth, such as its spatiotemporal evolution and the interplay between orogenesis and crustal accretion. Their relative immobility during high-grade metamorphism makes the Sm-Nd isotopic system crucial for inferring crustal formation times. Historically, data have been disseminated sporadically in the scientific literature due to complicated and costly sampling procedures, resulting in a fragmented knowledge base. However, the scattering of critical geoscience data across multiple publications poses significant challenges regarding human capital and time. In response, we present an automated tabular extraction method for harvesting tabular geoscience data. We collect 10,624 Sm-Nd data entries from 9,138 tables in over 20,000 geoscience publications using this method. We manually selected 2,118 data points from it to supplement our previously constructed global Sm-Nd dataset, increasing its sample count by over 20\%. Our automatic data collection methodology enhances the efficiency of data acquisition processes spanning various scientific domains. Furthermore, the constructed Sm-Nd isotopic dataset should motivate the research of classifying global orogenic belts.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to efficiently and automatically extract and integrate Sm - Nd isotope data from a large number of geological science literatures. Specifically, the authors proposed an automated table extraction method to address the following challenges: 1. **Data dispersion**: Sm - Nd isotope data are distributed in a large number of scientific literatures, resulting in fragmented knowledge and making it difficult to obtain and analyze systematically. 2. **Inefficiency of manual collection**: Traditionally, these data mainly rely on manual collection and classification. Faced with a large amount of data and complex structures, this method is inefficient and error - prone. 3. **Data processing in multidisciplinary cross - fields**: Geological science data are multi - dimensional and complex, involving multiple aspects such as space, time, and chemical composition, and require efficient processing methods. By proposing an automated table extraction method, the authors aim to improve the collection efficiency of Sm - Nd isotope data and provide more comprehensive data support for the research of global orogenic belts. This not only helps to understand the spatio - temporal evolution of the crust and the relationship between orogeny and crustal accretion, but also promotes the inference of the formation time of the continental crust. ### Specific problems: - **How to efficiently extract and integrate Sm - Nd isotope data**: The paper proposed an automated table extraction tool that can quickly extract relevant data from a large number of literatures. - **How to deal with data dispersion and the inefficiency of manual collection**: Through the automated method, the speed and accuracy of data collection are significantly improved. - **How to process data in multidisciplinary cross - fields**: This method can handle complex table structures and ensure the integrity and consistency of data. ### Solutions: The authors developed a two - stage workflow, including document retrieval and table data collection. The specific steps are as follows: 1. **Document retrieval**: Use tools such as CERMINE to extract the metadata of PDF documents, and optimize the selection of target literatures through keyword queries. 2. **Table data collection**: Identify and extract data in tables through computer vision technology and OCR technology, and finally integrate the data into an Excel table. 3. **Data processing**: Locate, expand, standardize, and integrate metadata for the extracted data to ensure the integrity and usability of the data. Through this method, the authors successfully collected more than 10,624 Sm - Nd isotope data entries, significantly increasing the sample size of the global Sm - Nd data set and providing a solid foundation for subsequent research.