Unknown Word Detection in Song Poetry.

Xia Li,Bin Wu,Bailing Zhang
DOI: https://doi.org/10.1109/dsc.2016.47
2016-01-01
Abstract:Mining Ancient Chinese corpus is not as convenient as mining modern Chinese, because tokenizers perform poorly on them for lack of a complete dictionary of ancient Chinese words. So finding an effective way to find all words of these texts is significant. In this paper, we treat unknown word detection as a binary classification task, and propose new effective features for classification, including word embeddings learned from word2vec, activation distance between characters and statistical transform probability learned from modern Chinese corpus. These three features improve classification' precision and recall. A modified SVM algorithm is then proposed to detect all unknown words from candidate character sequences. With new found words added into the dictionary, the tokenizer increases its precision greatly.
What problem does this paper attempt to address?