Abstract:Words provide a useful source of information for Chinese NLP, and word segmentation has been taken as a pre-processing step for most downstream tasks. For many NLP tasks, however, word segmentation can introduce noise and lead to error propagation. The rise of neural representation learning models allows sentence-level semantic information to be collected from characters directly. As a result, it is an empirical question whether a fully character-based model should be used instead of first performing word segmentation. We investigate a neural representation that simultaneously encodes character and word information without the need for segmentation. In particular, candidate words are found in a sentence by matching with a pre-defined lexicon. A lattice structured LSTM is used to encode the resulting word-character lattice, where gate vectors are used to control information flow through words, so that the more useful words can be automatically identified by end-to-end training. We compare the performance of the resulting lattice LSTM and baseline sequence LSTM structures over both character sequences and automatically segmented word sequences. Results on NER show that the character-word lattice model can significantly improve the performance. In addition, as a general sentence representation architecture, character-word lattice LSTM can also be used for learning contextualized representations. To this end, we compare lattice LSTM structure with its sequential LSTM counterpart, namely ELMo. Results show that our lattice version of ELMo gives better language modeling performances. On Chinese POS-tagging, chunking and syntactic parsing tasks, the resulting contextualized Chinese embeddings also give better performance than ELMo trained on the same data.

Understanding Subtitles by Character-Level Sequence-to-Sequence Learning.

Residual Recurrent Neural Networks for Learning Sequential Representations.

Lattice LSTM for Chinese Sentence Representation

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation

Sequence to Sequence Learning with Neural Networks

Character n-gram Embeddings to Improve RNN Language Models

End-to-End Subtitle Detection and Recognition for Videos in East Asian Languages via CNN Ensemble with Near-Human-Level Performance

A Sequential Neural Encoder with Latent Structured Description for Modeling Sentences.

Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation

Chinese Syllable-to-character Conversion with Recurrent Neural Network Based Supervised Sequence Labelling

A Character-Aware Encoder for Neural Machine Translation.

Effective Subword Segmentation for Text Comprehension

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

Character-level Chinese-English Translation through ASCII Encoding

Character-based Neural Machine Translation

Jointly Modeling Embedding and Translation to Bridge Video and Language

End-to-End Text Classification via Image-based Embedding using Character-level Networks

Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation

Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Empower Sequence Labeling with Task-Aware Neural Language Model