The ICT , CAS MT Systems for the IWSLT 09 Evaluation

Haitao Mi,Yang Liu,Tian Xia,Yang Feng,Xinyan Xiao,Jun Xie,Zhaopeng Tu,Hao Xiong,Daqi Zheng,Yajuan Lü,Qun Liu
2009-01-01
Abstract:We only use the data provided by the organizer for each task. We first used the Chinese lexical analysis system ICTCLAS for splitting Chinese characters into words and a rule-based tokenizer for tokenizing English sentences. Then, we convert all alphanumeric characters to their 2byte representation. Finally, we ran GIZA++ and used the “grow-diagfinal” heuristic to get many-to-many word alignments. We used the SRI Language Modeling Toolkit to train the Chinese/English 5-gram language model with Kneser-Ney smoothing on the Chinese/English side of the training corpus respectively. Regarding to Silenus, we used the Chinese parser of Xiong et al.(2006) and English parser of Charniak et al.(2005) to parse the source and target side of the bilingual corpus into packed forests respectively. Then we pruned the forests with the marginal probability based insideoutside algorithm with a pruning threshold pe = 3. At the decoding time, we use a larger pruning threshold pd = 12 to generate the packed forest.
What problem does this paper attempt to address?