Cross-Domain Evaluation of POS Taggers: From Wall Street Journal to Fandom Wiki

Kia Kirstein Hansen,Rob van der Goot
2023-04-27
Abstract:The Wall Street Journal section of the Penn Treebank has been the de-facto standard for evaluating POS taggers for a long time, and accuracies over 97\% have been reported. However, less is known about out-of-domain tagger performance, especially with fine-grained label sets. Using data from Elder Scrolls Fandom, a wiki about the \textit{Elder Scrolls} video game universe, we create a modest dataset for qualitatively evaluating the cross-domain performance of two POS taggers: the Stanford tagger (Toutanova et al. 2003) and Bilty (Plank et al. 2016), both trained on WSJ. Our analyses show that performance on tokens seen during training is almost as good as in-domain performance, but accuracy on unknown tokens decreases from 90.37% to 78.37% (Stanford) and 87.84\% to 80.41\% (Bilty) across domains. Both taggers struggle with proper nouns and inconsistent capitalization.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of the performance of Part-of-Speech (POS) taggers on data from different domains. Specifically, the researchers focus on how POS taggers trained on standard datasets (such as the Wall Street Journal section of the Penn Treebank, WSJ) perform when processing texts from other domains, particularly when dealing with fine-grained tag sets. The paper evaluates the performance of two popular POS taggers (Stanford tagger and Bilty) on cross-domain data by creating a new dataset based on the Elder Scrolls Fandom (ESF) wiki. The study finds that while these taggers perform well on vocabulary seen in the training set, their accuracy significantly drops when handling unseen vocabulary, especially in distinguishing proper nouns from other parts of speech. Additionally, the paper explores the taggers' performance in dealing with issues such as case inconsistency and spelling errors, and analyzes common error types and their causes. In summary, this paper aims to explore the limitations and challenges of existing POS taggers in cross-domain applications, providing references for future research and improvements.