Abstract:Collocation is a lexical phenomenon in which two or more words are habitually combined together as some conventional way of saying things. Collocation information is essential to many natural language processing tasks such as word sense disambiguation, machine translation, and information extraction. Most of current works on collocation extraction are statistical based with limited precision and recall because they cannot well distinguish word co-occurrences, which are statistically significant, from true collocations, which are of habitual use and are thus either syntactically or semantically significant. The objective of this study is to investigate methods to improve the performance of collocation extraction algorithms. Different types of collocations are identified. Collocation extraction algorithms are then designed to target on different types of collocations using different features and criteria associated with these different types. In addition to improve statistical based collocation extraction algorithms, additional syntactic and semantic information are also incorporated into the algorithm to further improve the performance of collocation extraction. On the study of the statistical based algorithms, a new algorithm based on bi-directional word bi-grams is designed to help identify collocations with low co-occurrence frequency and are of fixed use. A large scale collocation answer set is established so that collocation extraction algorithms can be evaluated and compared objectively by using the same training corpus and corresponding answer set. Collocations are then categorized into four types based on their compositionality, substitutability, and modifiability. Based on the characteristics of each type of collocations, a multi-stage window-based collocation extraction is built where the n-gram collocations and different types of bi-gram collocations are separately extracted in different stages using different strategies and different discriminative features. A shallow Treebank, referred to as the PolyU Treebank, is annotated manually to provide syntactic and semantic knowledge to further help collocation extraction. This treebank is also used to train a chunker based on lexicalized Hidden Markov Model (HMM). The chunker provides ways to process running text for collocation extraction. By using the support collocation patterns and reject collocation patterns extracted from the annotated treebank and parsed running text, syntactic features are employed to further improve the performance of the window-based collocation extraction system. Experimental results show that the use of syntactic patterns can significantly improve the performance of collocation extraction, especially for filtering pseudo collocations. The extracted collocations were applied in the post-processing of a handwritten Chinese character recognition system. Experiments indicate that collocation information can be used in real application to improve the performance of these natural language related applications. It should be pointed out that this work focuses on collocation extraction of Chinese text. However, the techniques developed are applicable to other languages although separate annotations and understanding to different syntactical and semantics knowledge are needed. Keyword: Natural language processing, collocation extraction, Treebank, Chunking and parsing.

A corpus-based approach to English subordinate clause identification

Syntactic Complexity Development in the Writings of EFL Learners: Insights from a Dependency Syntactically-Annotated Corpus

A Method of Automatic Recognition of Attributive Clauses in Chinese Language

English BNP Identification Based on Corpus-Trained Decision Tree

A Machine Learning Approach to Determine Semantic Dependency Structure in Chinese.

The Study on Automatic Chinese Collocation Extraction

Complete Syntactic Analysis Bases on Multi-level Chunking

A Study of Translation Rule Classification for Syntax-Based Statistical Machine Translation

An Automatic Chinese Collocation Extraction Algorithm Based on Lexical Statistics

Chinese Complex Long Sentences Processing Method for Chinese-Japanese Machine Translation

Combining Data-Driven Constituent and Dependency Parsers for CIPS-ParsEval-2009

English-Chinese Corpus Collection and Artificial Intelligence Translation Based on Dynamic Clustering Model

Exploiting Clause Boundary Information As Features For Chinese Functional Chunk Parsing

Chinese Textual Entailment Recognition Based on Syntactic Tree Clipping

A multi-stage chinese collocation extraction system

An improved method for finding bilingual collocation correspondences from monolingual corpora

Automatically Determining Semantic Relations in Chinese Sentences

Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging.

A Cascaded Syntactic and Semantic Dependency Parsing System.

Towards Accurate and Efficient Chinese Part-of-Speech Tagging.

Parsing Penn Chinese Treebank Based on Lexicalized Model