Word Segmentation on Micro-blog Texts with External Lexicon and Heterogeneous Data.
Qingrong Xia,Zhenghua Li,Jiayuan Chao,Min Zhang
DOI: https://doi.org/10.1007/978-3-319-50496-4_64
2016-01-01
Abstract:This paper describes our system designed for the NLPCC 2016 shared task on word segmentation on micro-blog texts (i.e., Weibo). We treat word segmentation as a character-wise sequence labeling problem, and explore two directions to enhance our CRF-based baseline. First, we employ a large-scale external lexicon for constructing extra lexicon features in the model, which is proven to be extremely useful. Second, we exploit two heterogeneous datasets, i.e., Penn Chinese Treebank 7 (CTB7) and People Daily (PD) to help word segmentation on Weibo. We adopt two mainstream approaches, i.e., the guide-feature based approach and the recently proposed coupled sequence labeling approach. We combine the above techniques in different ways and obtain four well-performing models. Finally, we merge the outputs of the four models and obtain the final results via Viterbi-based re-decoding. On the test data of Weibo, our proposed approach outperforms the baseline by \(95.63-94.24=1.39\%\) in terms of F1 score. Our final system rank the first place among five participants in the open track in terms of F1 score, and is also the best among all 28 submissions. All codes, experiment configurations, and the external lexicon are released at http://hlt.suda.edu.cn/~zhli.