Resolution of Overlapping Ambiguity Strings Based on Smoothed Maximum Entropy Model with Character Feature
任惠,林鸿飞,杨志豪
DOI: https://doi.org/10.3969/j.issn.1003-0077.2010.04.003
2010-01-01
Abstract:The overlapping ambiguity strings(OAS) is one of the difficulties in automatic Chinese word segmentation.This paper treats the resolution of OAS asa classification task,using maximum entropy integrating character features to solve the problem.In order to overcome the data sparseness in maximum entropy modeling,this paper introduces the inequality smoothing techniques and Gaussian smoothing techniques.We compared the Gaussian smoothing,inequality smoothing and frequency discount on the four datasets of the Second International Chinese Word Segmentation,proving that Gaussian smoothing,inequality smoothing are much better than the discount method..while inequality smoothing enables the seamless integration of feature selectioninto the parameter estimation with the result of a significantly compressed model.On the four datasets,the precision of disambiguation by the proposed method can achieve 96.27%,96.83%,96.56%,96.52% respectively,with a relative improvement of 5.87%,5.64%,5.00%,5.00% by the rich feature and a relative improvement of 5.87%,5.64%,5.00%,5.00% by smoothing technology.Meanwhile,the classification models are compressed by 38.7,19.9,44.6,9.7 by using inequality smoothing.