Abstract:The usefulness of a statistical approach suggested by Church et al. (1991) is evaluated for the extraction of verb-noun (V-N) collocations from German text corpora. Some problematic issues of that method arising from properties of the German language are discussed and various modifications of the method are considered that might improve extraction results for German. The precision and recall of all variant methods is evaluated for V-N collocations containing support verbs, and the consequences for further work on the extraction of collocations from German corpora are discussed. With a sufficiently large corpus (>= 6 mio. word-tokens), the average error rate of wrong extractions can be reduced to 2.2% (97.8% precision) with the most restrictive method, however with a loss in data of almost 50% compared to a less restrictive method with still 87.6% precision. Depending on the goal to be achieved, emphasis can be put on a high recall for lexicographic purposes or on high precision for automatic lexical acquisition, in each case unfortunately leading to a decrease of the corresponding other variable. Low recall can still be acceptable if very large corpora (i.e. 50 - 100 million words) are available or if corpora for special domains are used in addition to the data found in machine readable (collocation) dictionaries.

Study on Statistical Methods for Automatic Collocation Extraction from Large-Scale Corpus

A Study on the Influence of Computer Corpus Software on College Students' English Vocabulary Learning.

Association Measures for Collocation Extraction

Collocation Extraction Using Monolingual Word Alignment Method.

Research on Collocation Extraction Based on Syntactic and Semantic Dependency Analysis.

Extracting terminologically relevant collocations in the translation of chinese monograph

Collocation Use in EFL Learners’ Writing Across Multiple Language Proficiencies: A Corpus-Driven Study

Improving Statistical Machine Translation with monolingual collocation

Large-scale Automatic Extraction of Chinese Compound Lexical Cohesion Pairs

Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German

Study of Automatic Abstracting Based on Corpus and Hierarchical Dictionary

Chinese Partial Parser for Automatic Extraction of Verb Grammatical Collocations

The Methods of Machine Learning for Natural Language Information Extraction

Corpus-based Study on the Differences of Verb-object-noun Collocations

Analysis of Parts-of-speech Correspondence Between DCC and GKB

Quality Assurance Of Automatic Annotation Of Very Large Corpora: A Study Based On Heterogeneous Tagging Systems

Research on Corpus Annotation Method Based on Collective Intelligence

Automatic Extraction of Multiword Expressions Combining Statistical and Similarity Approaches

Automatic Collecting of Text Data for Cantonese Language Modeling

Statistical Analyses on Chinese Ancient Books fo Information Retrieval

A large synchronous corpus as monitoring corpus: Some comparative content analysis of Chinese and Japanese language developments