Abstract:Statistical bilingual word alignment has been well studied in the context of machine translation. This paper adapts the bilingual word alignment algorithm to monolingual scenario to extract collocations from monolingual corpus. The monolingual corpus is first replicated to generate a parallel corpus, where each sentence pair consists of two identical sentences in the same language. Then the monolingual word alignment algorithm is employed to align the potentially collocated words in the monolingual sentences. Finally the aligned word pairs are ranked according to refined alignment probabilities and those with higher scores are extracted as collocations. We conducted experiments using Chinese and English corpora individually. Compared with previous approaches, which use association measures to extract collocations from the co-occurring word pairs within a given window, our method achieves higher precision and recall. According to human evaluation in terms of precision, our method achieves absolute improvements of 27.9% on the Chinese corpus and 23.6% on the English corpus, respectively. Especially, we can extract collocations with longer spans, achieving a high precision of 69% on the long-span (>6) Chinese collocations.

An Automatic Chinese Collocation Extraction Algorithm Based on Lexical Statistics

Extracting terminologically relevant collocations in the translation of chinese monograph

A CRF-based Method for Automatic Construction of Chinese Symptom Lexicon

Large-scale Automatic Extraction of Chinese Compound Lexical Cohesion Pairs

Chinese Partial Parser for Automatic Extraction of Verb Grammatical Collocations

Collocation Extraction Using Monolingual Word Alignment Method.

Research on Collocation Extraction Based on Syntactic and Semantic Dependency Analysis.

Automatic keyphrase extraction from chinese news documents

Chinese Keyword Extraction Based on N-Gram and Word Co-occurrence

Automatic Extraction of Multiword Expressions Combining Statistical and Similarity Approaches

Automatic summarization oriented Chinese word extraction and statistics system

Automatic Extraction of Lexical Relations from Chinese Machine Readable Dictionary

Association Measures for Collocation Extraction

Construction and Application of Chinese Generation Lexicon for Chinese Irregular Collocation Between Verbs and Nouns.

Automatic Construction of Chinese Stop Word List

Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

Word extraction based on semantic constraints in chinese word-formation

Parsing-based Automatic Chinese Term Extraction

Exploiting Lexicalized Statistical Patterns in Chinese Linguistic Analysis

Chinese Multi-word Chunks Extraction for Computer Aided Translation

Query Based Chinese Phrase Extraction for Site Search