Abstract:The lack of word delimiters such as spaces in Chinese texts makes word segmentation a special issue in Chinese text processing. As the volume of Chinese texts grows rapidly on the Internet, the number of unknown words increases accordingly. However, word segmentation approaches relying solely on existing dictionaries are helpless in handling unknown words. In this paper, we propose a novel unsupervised method to segment large Chinese corpora using contextual information. In particular, the number of characters preceding and following a string, known as the accessors of the string, is used to measure the independence of the string. The greater the independence, the more likely it is that the string is a word. The segmentation problem is then considered an optimization problem to maximize the target function of this number over all word candidates in an utterance. Our purpose here is to explore the best function in terms of segmentation performance. The performance is evaluated with the word token recall measure in addition to word type precision and word type recall. Among the three types of target functions that we have explored, polynomial functions turn out to outperform others. This simple method is effective in unsupervised segmentation of Chinese texts and its performance is highly comparable to other recently reported unsupervised segmentation methods.

Towards Unified Chinese Segmentation Algorithm

Chinese Word Segmentation Method Based on Dictionary and Frequency of the Words

Unsupervised segmentation of chinese corpus using accessor variety

Chinese Word Segmentation Without Using Lexicon and Hand-Crafted Training Data

Survey on Chinese Word Segmentation

Algorithm for Solving 3-Character Crossing Ambiguities in Chinese Word Segmentation

Chinese Word Segmentation Evaluation Methodology Based on Web Search Engines

Incorporate Web Search Technology to Solve Out-of-Vocabulary Words in Chinese Word Segmentation.

Some Problems of Chinese Segmentation

Chinese word segmentation at Peking University

Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information.

A Compression-based Algorithm for Chinese Word Segmentation

A Pragmatic Approach for Classical Chinese Word Segmentation.

Combine CRF and MMSEG to Boost Chinese Word Segmentation in Social Media

A Unicode Based Adaptive Segmentor

Chinese Word Segmentation Probability Dictionary Training and Enrich Solution

Page Segmentation of Chinese Newspapers

Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure

Segmentation standard for Chinese natural language processing

TopWORDS-Seg: Simultaneous Text Segmentation and Word Discovery for Open-Domain Chinese Texts via Bayesian Inference

Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model