Icpe: A Hybrid Data Selection Model For Smt Domain Adaptation

Longyue Wang,Derek F. Wong,Lidia S. Chao,Yi Lu,Junwen Xing
DOI: https://doi.org/10.1007/978-3-642-41491-6_26
2013-01-01
Abstract:Data selection is a significant technique to enhance the data-driven models especially for large-scale natural language processing (NLP). Recent research on statistical machine translation (SMT) domain adaptation focuses on the usage of various individual data selection models. In this paper, we proposed a hybrid data selection model named iCPE, which combines three state-of-the-art similarity metrics: Cosine tf-idf, Perplexity and Edit distance at both corpus level and model level. We conduct the experiments on Hong Kong Law Chinese-English corpus and the results show that this simple and effective hybrid model performs better over the baseline system trained on entire data as well as the best rival method. This consistently boosting performance of the proposed approach has a profound implication for mining very large corpora in a computationally-limited environment.
What problem does this paper attempt to address?