Abstract:A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g. [12], can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in [15]. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with the robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.

Covering ambiguity resolution in Chinese word segmentation based on contextual information

Context-based Approach to Covering Ambiguity Resolution in Chinese Word Segmentation

Resolution to Combinational Ambiguity of Chinese Word Segmentation

Chinese Word Segmentation Method Based on Dictionary and Frequency of the Words

Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

Statistical Properties of Overlapping Word Segmentation Ambiguities in Domain-specific Chinese Corpora

Resolution to Chinese Combinational Ambiguity Combined Corpus-Based Method with Linguistics Knowledge

Parsing-based Chinese word segmentation integrating morphological and syntactic information

Algorithm for Solving 3-Character Crossing Ambiguities in Chinese Word Segmentation

A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information.

Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model

Shades of meaning: Uncovering the geometry of ambiguous word representations through contextualised language models

Word Sense Disambiguation Based on Positional Weighted Context

A Unified Model for Word Sense Representation and Disambiguation.

Chinese WSD Based on Context Calculation Model

A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

Towards Resolving Word Ambiguity with Word Embeddings

Application of the transformer model algorithm in chinese word sense disambiguation: a case study in chinese language

A realistic and robust model for Chinese word segmentation

Research of Reverse Backtracking Matching Algorithm for Chinese Word Segmentation

The Resolution of English Lexical Ambiguity by Chinese EFL Learners in Sentential Context