Abstract:A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.

Adapting Conventional Chinese Word Segmenter for Segmenting Micro-blog Text: Combining Rule-based and Statistic-based Approaches.

A Comparison of Chinese Word Segmentation on News and Microblog Corpora with a Lexicon Based Method.

Word Segmentation of Micro Blogs with Bagging.

Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations.

The CIPS-SIGHAN CLP 2012 ChineseWord Segmentation Onmicroblog Corpora Bakeoff.

Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-blog Texts

CRF-based Hybrid Model for Word Segmentation, NER and Even POS Tagging

Combine CRF and MMSEG to Boost Chinese Word Segmentation in Social Media

Adaptive Chinese Word Segmentation with Online Passive-Aggressive Algorithm.

A New Error-driven Learning Approach for Chinese Word Segmentation

A Unicode Based Adaptive Segmentor

A Hybrid Approach to Chinese Word Segmentation around CRFs

Addressing Domain Adaptation for Chinese Word Segmentation with Global Recurrent Structure.

Comparison and Optimization of Microblogging Segmentation Methods.

A Method Combining Rule-based and Statistics-based Approaches for Chinese Word Segmentation

Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

A realistic and robust model for Chinese word segmentation

A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation.

A Chinese Word Segmentation for Statistical Machine Translation

Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping