WikiNER-fr-gold: A Gold-Standard NER Corpus

Danrun Cao,Nicolas Béchet,Pierre-François Marteau

2024-10-29

Abstract:We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.

Computation and Language,Artificial Intelligence,Databases

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of insufficient annotation quality in the WikiNER corpus. Specifically: 1. **Limitations of Semi-Supervised Annotation**: WikiNER is a multilingual named entity recognition (NER) corpus, and its annotations are generated through a semi-supervised method without human verification. Therefore, it is referred to as a silver-standard corpus. This annotation method may lead to errors and inconsistencies. 2. **Specific Needs for French Texts**: Although WikiNER includes French texts, existing French NER corpora are either limited in quantity or require extensive correction work. Therefore, there is a need for a high-quality, manually corrected French NER corpus. ### Solution To improve the quality of the French part of WikiNER, the authors propose WikiNER-fr-gold, a revised French sub-corpus. The specific steps are as follows: 1. **Random Sampling**: Randomly sample 20% of the data from the original WikiNER French sub-corpus, including 26,818 sentences and approximately 700,000 tokens. 2. **Define Annotation Guidelines**: Summarize the entity types for each category to ensure consistency and accuracy in annotation. 3. **Manual Correction**: Manually correct the sampled data to fix errors and inconsistencies. 4. **Error Analysis**: Analyze the errors in the original WikiNER-fr corpus and discuss future directions for improvement. ### Main Contributions 1. **High-Quality French NER Corpus**: WikiNER-fr-gold provides a high-quality, manually corrected French NER corpus that can be used for training and evaluating NER systems. 2. **Detailed Error Analysis**: The paper provides a detailed analysis of common error types in the original corpus and offers specific correction strategies. 3. **Future Work Directions**: Discusses the potential for further improvement and expansion of WikiNER-fr-gold, including automated correction and cross-lingual expansion. ### Conclusion By creating WikiNER-fr-gold, the authors provide a high-quality French NER corpus that helps improve the performance of French NER tasks. Future work will include more comprehensive evaluation and further corrections to cover more data and languages.

WikiNER-fr-gold: A Gold-Standard NER Corpus

Automatically Building Large-Scale Named Entity Recognition Corpora from Chinese Wikipedia

WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

MSNER: A Multilingual Speech Dataset for Named Entity Recognition

MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy Named Entity Recognition

EduNER: a Chinese Named Entity Recognition Dataset for Education Research

E-NER -- An Annotated Named Entity Recognition Corpus of Legal Text

A Corpus for Named Entity Recognition in Chinese Novels with Multi-genres

Comparative Analysis of Extrinsic Factors for NER in French

GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets

MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition

Establishing a New State-of-the-Art for French Named Entity Recognition

NanoNER: Named Entity Recognition for nanobiology using experts' knowledge and distant supervision

Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Annotation Guidelines for Corpus Novelties: Part 1 -- Named Entity Recognition

Terminologies augmented recurrent neural network model for clinical named entity recognition

Annotation Errors and NER: A Study with OntoNotes 5.0

GenNER - A highly scalable and optimal NER method for text-based gene and protein recognition