Abstract:The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection ($\sim 658$ Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains. Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model's quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary.

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Language Resources for Dutch Large Language Modelling

GottBERT: a pure German Language Model

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

DUMB: A Benchmark for Smart Evaluation of Dutch Models

RoBERTa: A Robustly Optimized BERT Pretraining Approach

FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers

A Retrieval-Augmented Generation Based Large Language Model Benchmarked On a Novel Dataset

RoBERTuito: a pre-trained language model for social media text in Spanish

GEITje 7B Ultra: A Conversational Model for Dutch

Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model

Interpreting Language Models Through Knowledge Graph Extraction

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

KR-BERT: A Small-Scale Korean-Specific Language Model

Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

Optimizing small BERTs trained for German NER

Re-Evaluating GermEval17 Using German Pre-Trained Language Models

WangchanBERTa: Pretraining transformer-based Thai Language Models

ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding

Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain

WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models