HistNERo: Historical Named Entity Recognition for the Romanian Language

Andrei-Marius Avram,Andreea Iuga,George-Vlad Manolache,Vlad-Cristian Matei,Răzvan-Gabriel Micliuş,Vlad-Andrei Muntean,Manuel-Petru Sorlescu,Dragoş-Andrei Şerban,Adrian-Dinu Urse,Vasile Păiş,Dumitru-Clementin Cercel
2024-05-01
Abstract:This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of Named Entity Recognition (NER) in Romanian historical newspapers. Specifically, the paper introduces HistNERo, the first named entity recognition corpus for Romanian historical newspapers. This dataset contains historical texts from the mid-19th century (1817) to the late 20th century (1990), totaling 323,000 tokens. The dataset is annotated by eight native Romanian speakers with five types of named entities (i.e., person names, organizations, locations, products, and dates). These samples come from four historical regions of Romania: Bessarabia, Moldavia, Transylvania, and Wallachia. The main contributions of the paper include: 1. **Creation of an open-license corpus**: It contains 323,000 tokens and is annotated with five types of named entities by a group of annotators. 2. **Evaluation of various Romanian pre-trained language models on this dataset**: Experimental results show that the best model, RoBERT-base, achieved a strict F1 score of 55.69%. 3. **Proposal of a new domain adaptation technique (loss inversion)**: Through this method, the model can better distinguish features from different regions, thereby improving the performance of the best model to a strict F1 score of 66.80%, an absolute gain of over 10%. Overall, the paper aims to improve the performance of named entity recognition in Romanian historical texts by constructing and evaluating the HistNERo dataset, particularly in handling texts from different historical regions.