Abstract:Over the past ten years, large amounts of original research data related to Earth system science have been made available at a rapidly increasing rate. Such growing data stock helps researchers understand the human-Earth system across different fields. A substantial amount of this data is published by geoscientists as open-access in authoritative journals. If the information stored in this literature is properly extracted, there is significant potential to build a domain knowledge base. However, this potential remains largely unfulfilled in geoscience, with one of the biggest obstacles being the lack of publicly available related corpora and baselines. To fill this gap, the Earth Science Data Corpus (ESDC), an academic text corpus of 600 abstracts, was built from the international journal Earth System Science Data (ESSD). To the best of our knowledge, ESDC is the first corpus with the needed detail to provide a professional training dataset for knowledge extraction and construction of domain-specific knowledge graphs from massive amounts of literature. The production process of ESDC incorporates both the contextual features of spatiotemporal entities and the linguistic characteristics of academic literature. Furthermore, annotation guidelines and procedures tailored for Earth science data are formulated to ensure reliability. ChatGPT with zero- and few-shot prompting, BARTNER generative, and W2NER discriminative models were trained on ESDC to evaluate the performance of the name entity recognition task and showed increasing performance metrics, with the highest achieved by BARTNER. Performance metrics for various entity types output by each model were also assessed. We utilized the trained BARTNER model to perform model inference on a larger unlabeled literature corpus, aiming to automatically extract a broader and richer set of entity information. Subsequently, the extracted entity information was mapped and associated with the Earth science data knowledge graph. Around this knowledge graph, this paper validates multiple downstream applications, including hot topic research analysis, scientometric analysis, and knowledge-enhanced large language model question-answering systems. These applications have demonstrated that the ESDC can provide scientists from different disciplines with information on Earth science data, help them better understand and obtain data, and promote further exploration in their respective professional fields.

VOYAGE: A Large Collection of Vocabulary Usage in Open RDF Datasets

VOYAGE: A Large Collection of Vocabulary Usage in Open RDF Datasets

Evaluating the Quality of RDF Data Sets on Common Vocabularies in the Social, Behavioral, and Economic Sciences

VisImages: a Corpus of Visualizations in the Images of Visualization Publications

VisImages: A Corpus of Images from Visualization Publications

VisImages: A Large-scale, High-quality Image Corpus in Visualization Publications.

Relatedness Between Vocabularies on the Web of Data: A Taxonomy and an Empirical Study

OHDSI Standardized Vocabularies—a large-scale centralized reference ontology for international data harmonization

Constraints to Validate RDF Data Quality on Common Vocabularies in the Social, Behavioral, and Economic Sciences

What's In My Big Data?

Open-Vocabulary Category-Level Object Pose and Size Estimation

NJVR: The NanJing Vocabulary Repository

Data of the Study "A Method to Assess Spatio-Temporal Units for Specific Tasks Based on Explainable Artificial Intelligence"

Mooccube: A Large-Scale Data Repository For Nlp Applications In Moocs

259067 Subject-Predicate-Object Triples Extracted from Scientific Documents Regarding Cardiovascular Research in China During 2000-2020

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas

Objaverse: A Universe of Annotated 3D Objects

Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML

ESDC: An open Earth science data corpus to support geoscientific literature information extraction

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

RedPajama: an Open Dataset for Training Large Language Models