Building a Large English-Chinese Parallel Corpus from Comparable Patents and Its Experimental Application to SMT

Bin Lu,Tao Jiang,Kapo Chow,Benjamin K. Tsou
2010-01-01
Abstract:The paper provides an account on the augmentation of a Chinese-English patent parallel corpus consisting of about 160K sentence pairs, which has been enlarged by about 45 times to more than 7 million sentence pairs mostly by the means of “harvesting” comparable patents from the Web. First, based on a large corpus of English-Chinese comparable patents, more than 22 million bilingual sentence pair candidates have been mined, of which we extract more than 7 million high-quality parallel sentences, which to our best knowledge is the largest parallel sentence corpus in the patent domain. Based on 1 million parallel sentences extracted from the abstract and claims sections, some interesting preliminary SMT results are also reported here. Last by not least, the method and approach proposed here should be applicable to other languages, which shows a novel way on how to reduce the data acquisition bottleneck in multilingual language processing.
What problem does this paper attempt to address?