Evaluating Rewards for Question Generation Models

Tom Hosking,Sebastian Riedel
DOI: https://doi.org/10.48550/arXiv.1902.11049
2019-06-01
Abstract:Recent approaches to question generation have used modifications to a Seq2Seq architecture inspired by advances in machine translation. Models are trained using teacher forcing to optimise only the one-step-ahead prediction. However, at test time, the model is asked to generate a whole sequence, causing errors to propagate through the generation process (exposure bias). A number of authors have proposed countering this bias by optimising for a reward that is less tightly coupled to the training data, using reinforcement learning. We optimise directly for quality metrics, including a novel approach using a discriminator learned directly from the training data. We confirm that policy gradient methods can be used to decouple training from the ground truth, leading to increases in the metrics used as rewards. We perform a human evaluation, and show that although these metrics have previously been assumed to be good proxies for question quality, they are poorly aligned with human judgement and the model simply learns to exploit the weaknesses of the reward source.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to optimize the model to improve the quality of generated questions when generating natural - language questions. Specifically, the author focuses on the fact that existing models use the teacher forcing technique for word - by - word prediction optimization during the training process, but need to generate the entire sequence during testing, which leads to exposure bias, that is, the error accumulation in the model generation process. In addition, existing models mainly rely on copying ground truth data for optimization, which limits the model's ability to explore a broader possibility space. To address these problems, the author proposes several optimization strategies, including directly optimizing for different objective functions, such as using an adversarial discriminator to generate questions that are indistinguishable from real examples. Through these methods, the author hopes that the model can better recover from non - optimal predictions and generate higher - quality questions. However, the research has found that although these optimization strategies improve automatic evaluation metrics (such as BLEU scores, language model scores, etc.), human evaluation shows that the quality of questions generated by these optimized models is actually inferior to that of unoptimized models. This indicates that the currently used automatic evaluation metrics may not be a good proxy for measuring the quality of question generation, and the model may take advantage of the weaknesses of these metrics to obtain high scores, while the actually generated questions may be of low quality to humans.