Abstract:A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.

A Joint Model for Unsupervised Chinese Word Segmentation.

A Refined Hdp-Based Model For Unsupervised Chinese Word Segmentation

Unsupervised Neural Word Segmentation for Chinese Via Segmental Language Modeling

Joint n-gram Chinese language modeling with an application to Chinese word segmentation

A Local Generative Model For Chinese Word Segmentation

A Deep Convolutional Neural Model for Character-Based Chinese Word Segmentation

A Unified Model for Joint Chinese Word Segmentation and POS Tagging with Heterogeneous Annotation Corpora.

A Unified Model for Joint Chinese Word Segmentation and Dependency Parsing

A realistic and robust model for Chinese word segmentation

A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

Unsupervised Word Segmentation with Bi-directional Neural Language Model

Segment, Mask, and Predict: Augmenting Chinese Word Segmentation with Self-Supervision

A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information.

A Comparison Study of Candidate Generation for Chinese Word Segmentation

A Comparative Study on Chinese Word Segmentation Using Statistical Models

Chinese Word Segmentation with Heterogeneous Graph Neural Network

A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing

A joint method for Chinese word segmentation and part-of-speech labeling based on deep neural network

Chinese Word Segmentation with Maximum Entropy and N-gram Language Model

Joint Chinese Word Segmentation and Span-based Constituency Parsing

Bidirectional LSTM-CRF Attention-based Model for Chinese Word Segmentation