Improving Chinese Word Segmentation with Description Length Gain

Chunyu Kit,Hai Zhao
2007-01-01
Abstract:Supervised and unsupervised learning hasseldom joined with and thus lend strength to each otherin the field of Chinese word segmentation (CWS). Thispaper presents a novel approach to CWS that utilizesdescription length gain (DLG), an empirical goodnessmeasure for unsupervised word discovery, to enhancethe segmentation performance of conditional randomfield (CRF) learning. Specifically, we attempt to in-tegrate the lexical information acquired from the un-supervised DLG segmentation into the supervised CRFlearning of character tagging for CWS. Our experimen-tal results show that the CRF learning can be furtherimproved on top of its state-of-the-art performance inthe field by making good use of DLG information. Keywords: Chinese word segmentation, descriptionlength gain, conditional random fields 1 Introduction The task of Chinese word segmentation (CWS) isto segment an input text into words. It is a specialcase of tokenization in natural language process-ing (NLP) shared by many other languages thathave no explicit word delimiters such as spaces.Researchers in the field have been pursuing vari-ous machine learning approaches for further perfor-mance enhancement since Bakeoff-2003
What problem does this paper attempt to address?