Abstract:Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based on statistical observations from both general-purpose corpus and domain-specific corpora. A disambiguation strategy for overlapping ambiguities, with a predefined solution for each of the 5,507 pseudo overlapping ambiguities, is proposed consequently, suggesting that over 42% of overlapping ambiguities in Chinese running text could be solved without making any error. Several state-of-the-art word segmenters are used to make comparisons on solving these overlapping ambiguities. Preliminary experiments show that about 2% of the 5,507 pseudo ambiguities which are mistakenly segmented by these segmenters can be properly treated by the proposed strategy.

A Disabiguation Method for Segmenting Chinese Words by Using Relaxation Algorithm Based on Context

Chinese Word Segmentation Method Based on Dictionary and Frequency of the Words

Algorithm for Solving 3-Character Crossing Ambiguities in Chinese Word Segmentation

Ambiguity Identification Strategy of Chinese Word Segmentation

Resolution to Combinational Ambiguity of Chinese Word Segmentation

Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

Chinese Word Segmentation with Character Abstraction.

A search-based Chinese word segmentation method.

A Hybrid Approach to the Real World Text Segmentation

A Compression-based Algorithm for Chinese Word Segmentation

Some Problems of Chinese Segmentation

Algorithm Based on SVM and Rules for the Disambiguation of Combinatorial Ambiguous Phrases

Resolution to Chinese Combinational Ambiguity Combined Corpus-Based Method with Linguistics Knowledge

A Pragmatic Approach to Increase Accuracy of Chinese Word-Segmentation

A Statistical Approach For Resolving Problematical Word Boundaries In Chinese Lexicography

Increasing Accuracy of Chinese Segmentation with Strategy of Multi-step Processing

Research of Reverse Backtracking Matching Algorithm for Chinese Word Segmentation

Covering ambiguity resolution in Chinese word segmentation based on contextual information

A Unicode Based Adaptive Segmentor

Research on the Application of a Chinese Semantic Knowledge Base in Chinese Phrase Disambiguation

A Hybrid Method For Syntactic And Semantic Structure Disambiguation For Chinese