Abstract:Abstract Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights. One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences. To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization. BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT. We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.

SciBERT: A Pretrained Language Model for Scientific Text

SciBERT: A Pretrained Language Model for Scientific Text

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Fine-Tuning Large Language Models for Scientific Text Classification: A Comparative Study

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Deep Pre-Training Transformers for Scientific Paper Representation

ClimateBert: A Pretrained Language Model for Climate-Related Text

MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling

Enriched BERT Embeddings for Scholarly Publication Classification

Towards understanding evolution of science through language model series

MatSciBERT: A materials domain language model for text mining and information extraction

MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education

CSDR-BERT: a pre-trained scientific dataset match model for Chinese Scientific Dataset Retrieval

FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining

bert2BERT: Towards Reusable Pretrained Language Models

Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT

Geoscience Language Processing for Exploration

SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian

RoBERTa: A Robustly Optimized BERT Pretraining Approach