A Simple and Effective Approach to Coverage-Aware Neural Machine Translation Supplementary Material

Yanyang Li,Tong Xiao,Yinqiao Li,Qiang Wang,Changming Xu,Xueqiang Lu
2018-01-01
Abstract:To combine LN and CS, we use Eq. (1) for each time step. The first term of Eq. (1) denotes the standard log-likelihood normalized by LN. The second term is CS divided by the length of source sentence |x|. This division is a form of normalization to preserve similar scale as the normalized log-likelihood because the normalized loglikelihood might no longer decline as decoding proceeded, while the raw coverage score would increase and lower the performances. Since CS is the sum of log scores over x-axis, it is divided by the length of source sentence |x| instead of target sentence |y|. Finally we linearly interpolate these two scores together for hypotheses comparison during beam search.
What problem does this paper attempt to address?