Metadata Might Make Language Models Better

Kaspar Beelen,Daniel van Strien
DOI: https://doi.org/10.48550/arXiv.2211.10086
2022-11-18
Abstract:This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.
Computation and Language,Digital Libraries
What problem does this paper attempt to address?