Mining Large-scale Parallel Corpora from Multilingual Patents: an English-Chinese Example and Its Application to SMT

Bin Lu,Benjamin Ka-Yin T'sou,Tao Jiang,Oi Yee Kwong,Jingbo Zhu
2010-01-01
Abstract:In this paper, we demonstrate how to mine large-scale parallel corpora with multilingual patents, which have not been thoroughly explored before. We show how a large-scale English-Chinese parallel corpus containing over 14 million sentence pairs with only 1-5% wrong can be mined from a large amount of English-Chinese bilingual patents. To our knowledge, this is the largest single parallel corpus in terms of sentence pairs. Moreover, we estimate the potential for mining multilingual parallel corpora involving English, Chinese, Japanese, Korean, German, etc., which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing.
What problem does this paper attempt to address?