An Improved Unknown Word Recognition Model Based on Multi-Knowledge Source Method.
Wei Jiang,Yi Guan,Xiao-Long Wang
DOI: https://doi.org/10.1109/isda.2006.253719
2006-01-01
Abstract:Unknown word recognition (UWR) is a difficult and foundational task in lexical processing and content-based understanding. And it can improve many text-based processing applications, such as Information Extraction, Question Answer system, Electronic Meeting System. However the unified dealing approach is difficult to exploit more domain knowledge features, so the performance cannot be further improved easily, since UWR has been proved to be NP-hard problem. This paper presents a novel method for UWR task, which divides the UWR into several hard sub-tasks that usually encountering different difficulties, accordingly, several language models are adopted to solve the special sub-tasks, so as to exert the ability of each model in addressing special problems. Firstly, a class-based trigram is used in basic word segmentation, aided with absolute smoothing algorithm to overcome data sparseness. And Maximum Entropy Model (ME) is used to recognize Named Entity. New word detection adopts variance and Conditional Random Fields algorithm. Secondly, Multi-Knowledge features are effectively extracted and utilized in whole processing. Our system participated in the Second International Chinese Word Segmentation Bakeoff (SIGHAN2005), and got the overall performance 97.2% F-measure in MSRA open test.