Study on Statistical Methods for Automatic Collocation Extraction from Large-Scale Corpus

YAO Jian-min,QU Yun-qian,ZHU Qiao-ming,ZHANG Jing
DOI: https://doi.org/10.3969/j.issn.1000-7024.2007.09.053
2007-01-01
Abstract:Collocation is of great importance in dictionary compilation and natural language processing.Collocation extraction is one of the principal applications of corpus linguistics.Automatic extraction of bi-grams as candidate collocations is studied on Penn Treebank using the criteria of log likelihood,chi square and mutual information as association measure.The experimental results show the feasibility of the statistical methods.On the other hand,collocations extracted show different characteristics because of the different distribution assumptions by the three criteria.
What problem does this paper attempt to address?