Abstract:Words provide a useful source of information for Chinese NLP, and word segmentation has been taken as a pre-processing step for most downstream tasks. For many NLP tasks, however, word segmentation can introduce noise and lead to error propagation. The rise of neural representation learning models allows sentence-level semantic information to be collected from characters directly. As a result, it is an empirical question whether a fully character-based model should be used instead of first performing word segmentation. We investigate a neural representation that simultaneously encodes character and word information without the need for segmentation. In particular, candidate words are found in a sentence by matching with a pre-defined lexicon. A lattice structured LSTM is used to encode the resulting word-character lattice, where gate vectors are used to control information flow through words, so that the more useful words can be automatically identified by end-to-end training. We compare the performance of the resulting lattice LSTM and baseline sequence LSTM structures over both character sequences and automatically segmented word sequences. Results on NER show that the character-word lattice model can significantly improve the performance. In addition, as a general sentence representation architecture, character-word lattice LSTM can also be used for learning contextualized representations. To this end, we compare lattice LSTM structure with its sequential LSTM counterpart, namely ELMo. Results show that our lattice version of ELMo gives better language modeling performances. On Chinese POS-tagging, chunking and syntactic parsing tasks, the resulting contextualized Chinese embeddings also give better performance than ELMo trained on the same data.

Woodblock-Printing Mongolian Words Recognition by Bi-LSTM with Attention Mechanism.

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning.

Multi-font Printed Mongolian Document Recognition System

Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation

State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Bidirectional LSTM-CRF Attention-based Model for Chinese Word Segmentation

Chinese Image Text Recognition with BLSTM-CTC: A Segmentation-Free Method.

Segmentation and Recognition for Historical Tibetan Document Images

A Sequence Labeling Based Approach for Character Segmentation of Historical Documents

Lattice LSTM for Chinese Sentence Representation

Research on the LSTM Mongolian and Chinese machine translation based on morpheme encoding

A Multi-Scale Hybrid Attention Network for Sentence Segmentation Line Detection in Dongba Scripture

Long Short-Term Memory Neural Networks for Chinese Word Segmentation.

DAG-based Long Short-Term Memory for Neural Word Segmentation

Neural Chinese Word Segmentation as Sequence to Sequence Translation

A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion

Refocus attention span networks for handwriting line recognition

Fast Recurrent Neural Network with Bi-LSTM for Handwritten Tamil text segmentation in NLP

Neural Word Segmentation Learning for Chinese

An Efficient End-to-End Neural Model for Handwritten Text Recognition