Is Character Glyph Useless? Improving Neural Chinese Word Segmentation with Character Glyph Embedding

Zexue He,Jingle Xu,Mairgup Mansur,Baobao Chang
2018-01-01
Abstract:There is rich information hidden in the glyph of Chinese characters, which consist of many small picture-like components. However, there are few works aware of the importance of overall glyph information and even some draw negative conclusion on it. Based on the idea of utilizing the overall glyph information in Chinese word segmentation (CWS) task, we propose a model by introducing autoencoder before BiLSTM with CRF on our synthetic Chinese Character Image Datasets to generate character glyph embeddings. Our experimental results show that the model performs quite well without any extra external dictionaries, word features or resources on several standard datasets including Simplified Chinese and Traditional Chinese, whose glyph is more regular with less evolutionary simplifications. These verify the feasibility of Chinese character glyph for Chinese word segmentation, especially its impressive support in solving the out-of-vocabulary(OOV) words and its great help for Traditional Chinese Word Segmentation.
What problem does this paper attempt to address?