Effective Neural Solution for Multi-criteria Word Segmentation

Han He,Lei Wu,Hua Yan,Zhimin Gao,Yi Feng,George Townsend
DOI: https://doi.org/10.1007/978-981-13-1927-3_14
2018-11-05
Abstract:We present a novel and elegant deep learning solution to train a single joint model on multi-criteria corpora for Chinese Word Segmentation (CWS) challenge. Our innovative design requires no private layers in model architecture, instead, introduces two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. The rest of the model including Long Short-Term Memory (LSTM) layer and Conditional Random Fields (CRFs) layer remains unchanged and is shared across all datasets, keeping the size of parameter collection minimal and constant. On Bakeoff 2005 and Bakeoff 2008 datasets, our innovative design has surpassed the previous multi-criteria learning results. Testing results on two out of four datasets even have surpassed the latest state-of-the-art single-criterion learning scores. To the best knowledge, our design is the first one that has achieved the latest state-of-the-art performance on such large-scale datasets. Source codes and corpora of this paper are available on GitHub (https://github.com/hankcs/multi-criteria-cws).
What problem does this paper attempt to address?