Data Selection Via Semi-supervised Recursive Autoencoders for SMT Domain Adaptation

Yi Lu,Derek F. Wong,Lidia S. Chao,Longyue Wang
DOI: https://doi.org/10.1007/978-3-662-45701-6_2
2014-01-01
Abstract:In this paper, we present a novel data selection approach based on semi-supervised recursive autoencoders. The model is trained to capture the domain specific features and used for detecting sentences, which are relevant to a specific domain, from a large general-domain corpus. The selected data are used for adapting the built language model and translation model to target domain. Experiments were conducted on an in-domain (IWSLT2014 Chinese-English TED Talk) and a general-domain corpus (UM-Corpus). We evaluated the proposed data selection model in both intrinsic and extrinsic evaluations to investigate the selection successful rate (F-score) of pseudo data, as well as the translation quality (BLEU score) of adapting SMT systems. Empirical results reveal the proposed approach outperforms the state-of-the-art selection approach.
What problem does this paper attempt to address?