Abstract:Chinese word segmentation (CWS) is the foundational work of geological report text mining and has an important influence on various tasks, such as named entity recognition and relation extraction. In recent years, the accuracy of the domain‐general CWS model has been limited by the domain and large scale of the training corpus, especially data on Chinese geological texts. Training these CWS models also requires much manually annotated data, which takes a large amount of time and effort. When applying these existing models/methods directly to the geoscience domain, the segmentation accuracy and performance will drop dramatically. To address this problem, we pretrain the Bidirectional Encoder Representations from Transformer (BERT), which can leverage unlabeled domain‐specific knowledge, on unlabeled Chinese geological text and then input them into a Bidirectional long short‐term memory and Conditional random field (BiLSTM‐CRF) model for extracting text features. Finally, the predicted tags are decoded by the CRF. The experimental results show that the F1 score of the proposed model reaches 96.2% on the constructed test set of geological texts. Additionally, experiments illustrate that our proposed model achieves comparable performance to that of other state‐of‐the‐art models, and the proposed cyclic self‐learning strategy can be further extended to other domains. The supervised word segmentation model commonly lacks specialized knowledge in the training data set and has poor adaptability to the domain. This study proposes a sequential annotation model for geoscience text, which automatically construct domain training‐corpus and realize word segmentation taking into account the long‐distance dependence of sentences. We hope that our approach will serve as an alternative method that deserves further study. BERT is used to capture the abundant word level features, grammatical structure features and semantic features in sentences The self‐learning strategy assisted by domain knowledge can automatically construct the domain training corpus without manual intervention A set of experiments to verify the effectiveness of the proposed method on an available manually constructed hybrid data set BERT is used to capture the abundant word level features, grammatical structure features and semantic features in sentences The self‐learning strategy assisted by domain knowledge can automatically construct the domain training corpus without manual intervention A set of experiments to verify the effectiveness of the proposed method on an available manually constructed hybrid data set

Geoscience Language Processing for Exploration

GeoGPT: Understanding and Processing Geospatial Tasks through An Autonomous GPT

GeoGalactica: A Scientific Large Language Model in Geoscience

GeoGPT: An assistant for understanding and processing geospatial tasks

Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model

GeoBERT: Pre-Training Geospatial Representation Learning on Point-of-Interest

Few-shot learning for name entity recognition in geological text based on GeoBERT

Deep Pre-Training Transformers for Scientific Paper Representation

MetaQA: Enhancing human-centered data search using Generative Pre-trained Transformer (GPT) language model and artificial intelligence

G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German

Applications of Natural Language Processing to Geoscience Text Data and Prospectivity Modeling

SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction

SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation

Astro-HEP-BERT: A bidirectional language model for studying the meanings of concepts in astrophysics and high energy physics

Pretraining Billion-scale Geospatial Foundational Models on Frontier

Geode: A Zero-shot Geospatial Question-Answering Agent with Explicit Reasoning and Precise Spatio-Temporal Retrieval

Semantic maps and metrics for science Semantic maps and metrics for science using deep transformer encoders

Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain

When Geoscience Meets Foundation Models: Towards General Geoscience Artificial Intelligence System

ClimateBert: A Pretrained Language Model for Climate-Related Text

GeoTPE: A neural network model for geographical topic phrases extraction from literature based on BERT enhanced with relative position embedding