Abstract:Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are always being created. The procedure of new words identification and POS tagging are usually separated and the features of lexical information cannot be fully used. A latent discriminative model, which combines the strengths of Latent Dynamic Conditional Random Field (LDCRF) and semi-CRF, is proposed to detect new words together with their POS synchronously regardless of the types of new words from Chinese text without being pre-segmented. Unlike semi-CRF, in proposed latent discriminative model, LDCRF is applied to generate candidate entities, which accelerates the training speed and decreases the computational cost. The complexity of proposed hidden semi-CRF could be further adjusted by tuning the number of hidden variables and the number of candidate entities from the Nbest outputs of LDCRF model. A new-word-generating framework is proposed for model training and testing, under which the definitions and distributions of new words conform to the ones in real text. The global feature called “Global Fragment Features” for new word identification is adopted. We tested our model on the corpus from SIGHAN-6. Experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags with satisfactory results. The proposed model performs competitively with the state-of-the-art models.

Application of Conditional Random Fields Model in Unknown Words Identification

Chinese unknown word recognition using improved conditional random fields

An Improved Unknown Word Recognition Model Based on Multi-Knowledge Source Method.

Using Conditional Random Fields to Predict Focus Word Pair in Spontaneous Spoken English

Conditional Random Fields Based POS Tagging

A method of Part-Of-Speech guessing of Chinese Unknown Words based on combined features

Automatic Indexing Model Based on Conditional Random Fields

Automatic Identification of Concurrent Structure Based on Conditional Random Field

A Probabilistic Model with Multi-Dimensional Features for Object Extraction.

2D Conditional Random Fields for Web Information Extraction

Chinese Named Entity Recognition with the Improved Smoothed Conditional Random Fields

Identifying CpG Islands in Genome Using Conditional Random Fields.

Word Recognition with Deep Conditional Random Fields

Chinese New Word Identification: A Latent Discriminative Model with Global Features

Chinese Unknown Word Identification Based on Local Bigram Model

Part Detection, Description and Selection Based on Hidden Conditional Random Fields

Research on Domain Term Extraction Based on Conditional Random Fields

Transcript Mapping for Handwritten Text Lines Using Conditional Random Fields

A novel genome-wide polyadenylation sites recognition system based on condition random field.

A Chinese Part-of-speech Tagging Approach Using Conditional Random Fields

Abbreviation Prediction Using Conditional Random Field and Web Data