Chinese Multi-word Chunks Extraction for Computer Aided Translation

Baobao Chang
2007-01-01
Abstract:This paper suggests a methodology which is aimed to extract multi word chunks for translation purposes.Our basic idea is to use a hybrid method which combines the statistical method and linguistic rules.The extraction system used in our work operated at four steps:(1) Tokenization of Chinese corpus;(2) Extraction of multi-word chunks(2-gram to 10-gram) using Nagao's Algorithm and Substring Reduction Algorithm;(3)Statistical Filtering which combines Mutual Information(or Log-likelihood Ratio) and Left/Right Entropy;(4) Linguistic filtering by chunk formation rules and stop-word list.As a result,hybrid method proved to be a suitable method for selecting multi-word chunks,it has considerably improved the precision of the extraction which is much higher than that of purely statistical method.We believe that multi-word chunks extracted in this way could be used effectively to supplement existing translation memory database.
What problem does this paper attempt to address?