Research on Deep Processing Technologies for Large-Scale Corpora

QU Wei-guang,TANG Xu-ri,YU Jing-song
2009-01-01
Suvremena lingvistika
Abstract:This paper first examines critically the existing automatic proofreading technologies used in processing Chinese characters.It holds a distinction between shallow tagging and deep tagging.Shallow tagging refers to the use of the existing POS taggers to process texts without human correction of errors.Deep tagging,on the other hand,refers to the method of automatic tagging that improves shallow tagging.The proposed technology has been tested,and is found able to detect and correct more than 50,000 errors or inconsistencies in segmentation and POS tagging,using the template corpora.The proposed disambiguation model of PFR-SUM(sum of relative frequency ratio of words in context)shows excellent performance in classification,which detects a large amount of errors from template corpora and improves efficiency in corpora proofreading.The model also performs well in solving more than 400 types of common ambiguities when trained on the proofread template corpora and applied to large-scale corpora.
What problem does this paper attempt to address?