Novel Chinese Text Format Based on Word Encoding

焦慧,刘迁,贾惠波
DOI: https://doi.org/10.3969/j.issn.1002-137x.2008.10.041
2008-01-01
Computer Science
Abstract:The key reasons why Chinese word automatic segmentation arises and the difficulties in the process were analyzed.This paper presented a novel Chinese text encoding method and a new format.In this format,words become the smallest information unit of the texts,which makes the segmentation unnecessary and avoids the bad effects on CIP(Chinese Information Processing).A new encoding format that encodes every word(not character)was adopted.The difficulty of ambiguity was solved by using the encoding method.A new idea of solving the unknown word problem with the text format based on word encoding was presented.Statistical analysis was adopted to conduct the experiment of keyword extraction based on word platform.The experimental results are satisfying.
What problem does this paper attempt to address?