Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference

William Thorne,Ambrose Robinson,Bohua Peng,Chenghua Lin,Diana Maynard
2024-10-11
Abstract:As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it's equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method's effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of machine reading comprehension (MRC) datasets in the field of cultural heritage and their high cost of manual creation. Specifically: 1. **Lack of MRC Datasets in the Cultural Field**: Existing MRC datasets are mainly concentrated in the general field and lack the specificity and complexity for cultural heritage information. This makes the existing datasets unable to fully evaluate and improve search systems and dialogue systems in the cultural heritage field. 2. **Cost Problem of Manual Dataset Creation**: Manually creating high - quality MRC datasets is very expensive, especially for most cultural heritage institutions. For example, the creation of the SQuAD dataset cost approximately $12,000 just for writing questions, and the actual cost may be even higher. To solve these problems, the paper proposes a method of using reinforcement learning to generate domain - specific and highly - difficult MRC datasets from synthetic preference data. This method can not only reduce the cost of dataset creation but also increase the difficulty of generated questions, thereby better evaluating and improving technical systems in the cultural heritage field. ### Main Contributions 1. **Methodology**: Proposed a method of using PPO (Proximal Policy Optimization) and synthetic data to increase the difficulty of automatically generated questions. 2. **Empirical Evidence**: Provided evidence of the effectiveness of the method, including human evaluation results. 3. **Error Analysis**: Conducted in - depth error analysis and studied interesting phenomena occurring in the method. 4. **Open - source Code**: Released an open - source code library and three LLaMa - 2 - chat adapters for reproducing and adapting research results. Through these contributions, the paper aims to provide practitioners in the cultural heritage field with an efficient and cost - effective method to generate challenging evaluation datasets.