Research on Recognition Method of Unknown Chinese Words Based on Statistic and Regulation

ZHOU Lei,ZHU Qiaoming
DOI: https://doi.org/10.3969/j.issn.1000-3428.2007.08.069
2007-01-01
Abstract:This paper introduces a method to extract unknown Chinese words based on statistic and regulation.The process comprises two parts:(1) It segments the full text and combines the adjacent single Chinese character to short strings(fragments),then uses full-segmentation method to divide each fragment into strings,and each string is assigned a term weighted by rules and frequency.It uses the greedy algorithm to get the longest path of each fragment;every string except single character in this path is an unknown word.(2)It builds a bi-gram model and uses mutual information to combine some adjacent words to unknown words.The precision on the open test sets is 81.25% and recall is 82.38%.
What problem does this paper attempt to address?