Active Learning in Chinese Word Segmentation Based on Multigram Language Model

FENG Chong,CHEN Zhao-xiong,HUANG He-yan,GUAN Zhen-zhen
DOI: https://doi.org/10.3969/j.issn.1003-0077.2006.01.008
2006-01-01
Abstract:Word segmentation is a fundamental task in Chinese processing.To solve the difficulties of traditional methods in coping with various application domains and evolutive language phenomena,this paper adopts an unsupervised learning framework,using EM algorithm to train the n-multigram language model.A new certainty-based active learning segmentation algorithm is proposed,which combine labeled data with unlabeled data together to optimize language model.In experiments it outperforms other unsupervised word segmentation algorithms.
What problem does this paper attempt to address?