A Unicode Based Adaptive Segmentor

Q. Lu,S. T. Chan,R. F. Xu,T. S. Chiu,B. L. Li,S. W. Yu
DOI: https://doi.org/10.3115/1119250.1119275
2004-01-01
Abstract:This paper presents a Unicode based Chinese word segmentor. It can handle Chinese text in Simplified, Traditional, or mixed mode. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values, etc in the preprocessing stage. The segmentor further uses tagging information to work on disambiguation. Adopting a modular design approach, different functional parts are separately implemented using different modules and each module tackles one problem at a time providing more flexibility and extensibility. Results show that with added pre-processing modules and accessorial modules, the accuracy of the segmentor is increased and the system is easily adaptive to different applications.
What problem does this paper attempt to address?