Automatic Detection of Collocation
Jiangsheng Yu,Zhihui Jin,Zhenshan Wen
2008-01-01
Abstract:Collocation is a very important relation between words, which can be widely applied to semantic parsing (e.g., word sense disambiguation), machine translation (e.g., automatic alignment of bilingual corpus), computational lexicon, etc. Firstly, we summarized the methods of likelihood interval, likelihood ratio test, u test and ´2 test for collocation theoretically, and then utilized them to extract the collocations from a large scale corpus automatically. By experiment (some re- sults are listed in the appendix), the relationship between the statistical models are explored and analyzed. Some further researches are discussed in the conclusion. The corpus we used is a half year collection of People's Daily with segmentation and POS tagging, which contains at least 1,103,455 Chinese sentences. Keywords collocation, independence, hypothesis testing, likelihood interval, like- lihood ratio, ´2 test, normal distribution