Building of the Extremely Large Scale Words Collocation Corpus and Its Composition Analysis

Xu runhua,Chen xiaohe
DOI: https://doi.org/10.3969/j.issn.1008-9853.2011.03.006
2011-01-01
Abstract:There is an urgent demand for the building of large scale words collocation corpus in various aspects in the field of natural language processing.Using the three syntax analyzing machine developed respectively by Harbin Institute of Technology,the UC Berkley,and the Stanford University,this paper conducts syntax analysis on the corpora of People's Daily of 9 years.By merging the three results of analysis we get the collocation candidates,and then uses parameters and optimization to further improve the accuracy of collocation,and finally get a database with the scale of about 1.36 million collocation patterns and relevant statistic information,and have build a word collocation corpus.This database includes 6 common types of collocation data,and can insure a rather good precision rate.It may provide a reliable data support for other relevant works.
What problem does this paper attempt to address?