Abstract:Bilingual Named Entity (NE) pairs are valuable resources for many NLP applications. Since comparable corpora are more accessible, abundant and up-to-date, recent researches have concentrated on mining bilingual lexicons using comparable corpora. Leveraging comparable corpora, this research presents a novel approach to mining English-Chinese NE translations by combining multi-dimension features from various information sources for every possible NE pair, which include the transliteration model, English-Chinese matching, Chinese-English matching, translation model, length, and context vector. These features are integrated into one model with linear combination and minimum sample risk (MSR) algorithm. As for the high type-dependence of NE translation, we integrate different features according to different NE types. We experiment with the above individual feature or integrated features to mine person NE (PN) pairs, location NE (LN) pairs and organization NE (ON) pairs. When using transliteration and length to mine PN pairs, we achieve the best performance of 84.9% ( F -score). The LN pairs can be mined with the features of transliteration model, length, translation model, English-Chinese matching and Chinese-English matching. And the best performance is 83.4% ( F -score). The ON pairs can be mined with the features of English-Chinese matching and Chinese-English matching. It reaches the best performance with 84.1% ( F -score).

Complement the comparable corpus obtained from websites

Bilingual Terminology Extraction from Comparable E-Commerce Corpora

Study on Tibetan-Chinese Comparable Corpus Extraction

An Effective Approach For Searching Closest Sentence Translations From The Web

Extracting parallel phrases from comparable corpora

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

AUTOMATIC EXTRACTION OF CHINESE-ENGLISH PHRASE TRANSLATION PAIRS

Development of Translation Database based on Chinese-English parallel corpora

Creating Chinese-English Comparable Corpora

Automatic Acquisition of Large-scale Bilingual Sentence Pair

A Feasible Process for Mining Corpus from Web

A Similar Sentence Pair Retrieval Approach to Machine Translation

Term translation pair alignment based on a parallel corpus of Chinese historical classics and their english translations

Image-Image Search for Comparable Corpora Construction.

Paraphrase and Parallel Treebank for the Comparison of French and Chinese Syntax

EM-based Hybrid Model for Bilingual Terminology Extraction from Comparable Corpora.

A Chinese-Uighur Comparable Corpus

Corpus-based Extraction of Chinese Historical Term Translation Equivalents.

Extraction of translation unit from Chinese-English parallel corpora

Cross-Language Similar Document Retrieval

Mining English-Chinese Named Entity Pairs from Comparable Corpora.