Young-Suk Lee,Chulaka Gunasekara,Danish Contractor,Ramón Fernandez Astudillo,Radu Florian
Abstract:We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality multi - round dialogue data based on multiple documents in order to improve content - supported dialogue systems. Specifically, the authors aim to generate diverse, coherent, and correctly - answered synthetic dialogue data by simulating the deployment of Retriever - Augmented Generation (RAG) in the real world. These synthetic data can be used to train and evaluate dialogue models, thus outperforming the existing human - generated data on several public benchmark test sets.
### Main Challenges
1. **Diversity**: Ensure that the generated questions and dialogues are diverse enough.
2. **Coherence**: Ensure that the dialogue is natural and fluent, and that subsequent questions are not just a simple collection of question - answer pairs.
3. **Faithfulness**: Ensure that the model's answers are faithful to the content of the retrieved documents, rather than relying solely on the content generated by the model parameters.
### Solutions
To address these challenges, the authors propose the following techniques:
1. **Dialogue Flow Control**: Control the dialogue flow through classification - based user query generation and Chain - of - Thought (CoT) prompts.
2. **Multi - Document Support**: Imitate the way real - world retrievers are used and update the base documents after each user question.
3. **LLM - as - a - Judge**: Apply large - language models as judges to filter out queries with incorrect answers.
### Technical Details
- **Question Classification**: Two question classification methods are designed, which are used for the first round of the dialogue (ST - QT) and subsequent rounds (MT - QT) respectively, including types such as direct questions, comparison questions, aggregation questions, and unanswerable questions.
- **CoT Prompts**: Use CoT prompts to generate queries that conform to predefined question types and ensure that the generated answers are consistent with the documents.
- **Dialogue Generation Pipeline**: There are two modes, single - document and multi - document. The former generates dialogues based on a single document, while the latter dynamically selects relevant paragraphs in combination with the retriever.
- **LLM - as - a - Judge**: Evaluate each generated dialogue context - answer pair through LLM to ensure answer correctness.
### Experimental Results
By conducting experiments on two instruction - tuned models (MERLINITE - 7B and LLAMA - 2 - 13B - CHAT), the authors show that the models trained on synthetic data perform better than those trained on existing human - generated data on four public multi - round dialogue benchmark datasets (CoQA, MultiDoc2Dial, QuAC, and OR - QuAC). Especially for multi - document - supported tasks such as OR - QuAC, the effect of synthetic data is particularly significant.
### Summary
The main contributions of this paper are:
1. Propose the first multi - document - supported multi - round dialogue generation pipeline that simulates the real - world RAG deployment.
2. Ensure data diversity by classifying questions instead of relying solely on language models to generate queries.
3. Verify the quality of the generated dialogues through human evaluation.
4. Demonstrate the effectiveness of synthetic data on answerable queries.
5. Will release all code and synthetic data.
Through these methods, the authors have successfully addressed the challenges of generating high - quality, multi - document - based multi - round dialogue data and provided a new approach for the improvement of dialogue systems.