WikiNER-fr-gold: A Gold-Standard NER Corpus

Danrun Cao,Nicolas Béchet,Pierre-François Marteau
2024-10-29
Abstract:We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.
Computation and Language,Artificial Intelligence,Databases
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of insufficient annotation quality in the WikiNER corpus. Specifically: 1. **Limitations of Semi-Supervised Annotation**: WikiNER is a multilingual named entity recognition (NER) corpus, and its annotations are generated through a semi-supervised method without human verification. Therefore, it is referred to as a silver-standard corpus. This annotation method may lead to errors and inconsistencies. 2. **Specific Needs for French Texts**: Although WikiNER includes French texts, existing French NER corpora are either limited in quantity or require extensive correction work. Therefore, there is a need for a high-quality, manually corrected French NER corpus. ### Solution To improve the quality of the French part of WikiNER, the authors propose WikiNER-fr-gold, a revised French sub-corpus. The specific steps are as follows: 1. **Random Sampling**: Randomly sample 20% of the data from the original WikiNER French sub-corpus, including 26,818 sentences and approximately 700,000 tokens. 2. **Define Annotation Guidelines**: Summarize the entity types for each category to ensure consistency and accuracy in annotation. 3. **Manual Correction**: Manually correct the sampled data to fix errors and inconsistencies. 4. **Error Analysis**: Analyze the errors in the original WikiNER-fr corpus and discuss future directions for improvement. ### Main Contributions 1. **High-Quality French NER Corpus**: WikiNER-fr-gold provides a high-quality, manually corrected French NER corpus that can be used for training and evaluating NER systems. 2. **Detailed Error Analysis**: The paper provides a detailed analysis of common error types in the original corpus and offers specific correction strategies. 3. **Future Work Directions**: Discusses the potential for further improvement and expansion of WikiNER-fr-gold, including automated correction and cross-lingual expansion. ### Conclusion By creating WikiNER-fr-gold, the authors provide a high-quality French NER corpus that helps improve the performance of French NER tasks. Future work will include more comprehensive evaluation and further corrections to cover more data and languages.