Abstract:Words provide a useful source of information for Chinese NLP, and word segmentation has been taken as a pre-processing step for most downstream tasks. For many NLP tasks, however, word segmentation can introduce noise and lead to error propagation. The rise of neural representation learning models allows sentence-level semantic information to be collected from characters directly. As a result, it is an empirical question whether a fully character-based model should be used instead of first performing word segmentation. We investigate a neural representation that simultaneously encodes character and word information without the need for segmentation. In particular, candidate words are found in a sentence by matching with a pre-defined lexicon. A lattice structured LSTM is used to encode the resulting word-character lattice, where gate vectors are used to control information flow through words, so that the more useful words can be automatically identified by end-to-end training. We compare the performance of the resulting lattice LSTM and baseline sequence LSTM structures over both character sequences and automatically segmented word sequences. Results on NER show that the character-word lattice model can significantly improve the performance. In addition, as a general sentence representation architecture, character-word lattice LSTM can also be used for learning contextualized representations. To this end, we compare lattice LSTM structure with its sequential LSTM counterpart, namely ELMo. Results show that our lattice version of ELMo gives better language modeling performances. On Chinese POS-tagging, chunking and syntactic parsing tasks, the resulting contextualized Chinese embeddings also give better performance than ELMo trained on the same data.

Learning Chinese-Japanese Bilingual Word Embedding by Using Common Characters.

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Exploiting Common Characters in Chinese and Japanese to Learn Cross-Lingual Word Embeddings Via Matrix Factorization.

Joint Learning of Character and Word Embeddings.

Hierarchical Joint Learning for Chinese Word Embeddings

VCWE: Visual Character-Enhanced Word Embeddings

Combination Methods of Chinese Character and Word Embeddings in Deep Learning

Multiple Character Embeddings for Chinese Word Segmentation

Component-Enhanced Chinese Character Embeddings

Learning Chinese Word Embeddings from Stroke, Structure and Pinyin of Characters

Learning Sense-specific Word Embeddings By Exploiting Bilingual Resources.

Jointly Learning Bilingual Word Embeddings and Alignments

Glyph-aware Embedding of Chinese Characters

Learning Chinese word representation better by cascade morphological n-gram

Pronunciation-Enhanced Chinese Word Embedding

Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context

End-to-End Text Classification via Image-based Embedding using Character-level Networks

Improved Learning of Chinese Word Embeddings with Semantic Knowledge.

Lattice LSTM for Chinese Sentence Representation

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

A Novel Bilingual Word Embedding Method for Lexical Translation Using Bilingual Sense Clique