Abstract:As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it's equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method's effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the lack of machine reading comprehension (MRC) datasets in the field of cultural heritage and their high cost of manual creation. Specifically: 1. **Lack of MRC Datasets in the Cultural Field**: Existing MRC datasets are mainly concentrated in the general field and lack the specificity and complexity for cultural heritage information. This makes the existing datasets unable to fully evaluate and improve search systems and dialogue systems in the cultural heritage field. 2. **Cost Problem of Manual Dataset Creation**: Manually creating high - quality MRC datasets is very expensive, especially for most cultural heritage institutions. For example, the creation of the SQuAD dataset cost approximately $12,000 just for writing questions, and the actual cost may be even higher. To solve these problems, the paper proposes a method of using reinforcement learning to generate domain - specific and highly - difficult MRC datasets from synthetic preference data. This method can not only reduce the cost of dataset creation but also increase the difficulty of generated questions, thereby better evaluating and improving technical systems in the cultural heritage field. ### Main Contributions 1. **Methodology**: Proposed a method of using PPO (Proximal Policy Optimization) and synthetic data to increase the difficulty of automatically generated questions. 2. **Empirical Evidence**: Provided evidence of the effectiveness of the method, including human evaluation results. 3. **Error Analysis**: Conducted in - depth error analysis and studied interesting phenomena occurring in the method. 4. **Open - source Code**: Released an open - source code library and three LLaMa - 2 - chat adapters for reproducing and adapting research results. Through these contributions, the paper aims to provide practitioners in the cultural heritage field with an efficient and cost - effective method to generate challenging evaluation datasets.

Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference

Improving Socratic Question Generation using Data Augmentation and Preference Optimization

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Improving Reading Comprehension Question Generation with Data Augmentation and Overgenerate-and-rank

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Efficient In-Domain Question Answering for Resource-Constrained Environments

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference

Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Comparative Study

Towards Comprehensive Preference Data Collection for Reward Modeling

RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering

Aligning LLMs through Multi-perspective User Preference Ranking-based Feedback for Programming Question Answering

Diversify Question Generation with Retrieval-Augmented Style Transfer

Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework