Vietnamese Word Segmentation With.Crfs And Svms: An Investigation

Cam-Tu Nguyen,Trung-Kien Nguyen,Xuan Hieu Phan,Le-Minh Nguyen,Quang-Thuy Ha
2006-01-01
Abstract:Word segmentation for Vietnamese, like for most Asian languages, is an important task which has a significant impact on higher language processing levels. However, it has received little attention of the community due to the lack of a common annotated corpus for evaluation and comparison. Also, most previous studies focused on unsupervised-statistical approaches or combined too many techniques. Consequently, their accuracies are not as high as expected. This paper reports a careful investigation of using conditional random fields (CRFs) and support vector machines (SVMs) - two of the most successful statistical learning methods in NLP and pattern recognition - for solving the task. We first build a moderate annotated corpus using different sources of materials. For a careful evaluation, different CRF and SVM models using different feature settings were trained and their results are compared and contrasted with each other. In addition. we discuss several important points about the accuracy, computational cost, corpus size and other aspects that might influence the overall quality of Vietnamese word segmentation.
What problem does this paper attempt to address?