The Construction of a Chinese-English Patent Parallel Corpus

Bin Lu,Benjamin K. Tsou,Jingbo Zhu,Tao Jiang,Oi Yee Kwong
2009-01-01
Abstract:In this paper, we describe the construction of a parallel Chinese-English patent sentence corpus which is created from noisy parallel patents. First, we use a publicly available sentence aligner to find parallel sentence candidates in the noisy parallel data. Then we compare and evaluate three individual measures and different ensemble techniques to sort the parallel sentence candidates according to the confidence score and filter out those with low scores as the noisy data. The experiment shows that the combination of measures outperforms the individual measures, and that filtering out low-quality sentence pairs is readily justified as it can improve SMT performance. Finally, we arrive at the final corpus consisting of 160K sentence pairs in which about 90% are correct or partially correct alignments.
What problem does this paper attempt to address?