Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering

Mingxu Tao,Dongyan Zhao,Yansong Feng
2024-09-28
Abstract:Open-ended question answering requires models to find appropriate evidence to form well-reasoned, comprehensive and helpful answers. In practical applications, models also need to engage in extended discussions on potential scenarios closely relevant to the question. With augmentation of retrieval module, open-source Large Language Models (LLMs) can produce coherent answers often with different focuses, but are still sub-optimal in terms of reliable evidence selection and in-depth question analysis. In this paper, we propose a novel Chain-of-Discussion framework to leverage the synergy among multiple open-source LLMs aiming to provide \textbf{more correct} and \textbf{more comprehensive} answers for open-ended QA, although they are not strong enough individually. Our experiments show that discussions among multiple LLMs play a vital role in enhancing the quality of answers. We release our data and code at \url{<a class="link-external link-https" href="https://github.com/kobayashikanna01/Chain-of-Discussion" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficiency of evidence selection and analysis in answering complex open - ended questions. Specifically, when dealing with complex open - ended questions, although existing large - language models (LLMs) can generate coherent answers, they still have deficiencies in the selection of reliable evidence and in - depth question analysis. These problems are mainly reflected in two aspects: 1. **Imperfect retrieval models**: Retrieval models may introduce noise, causing the model to be unable to filter out all of this noisy evidence, thus affecting the integrity and accuracy of the answer. For example, in legal consultation, the model may erroneously return legal provisions related to guardianship qualifications instead of those related to economic support obligations due to semantic similarity. 2. **Comprehensiveness and consistency of answers**: It is expected that the model can not only provide correct answers, but also give consistent explanations and provide useful advice for situations that the user may encounter currently or in the future. However, even humans find it difficult to do this, especially when it is necessary to access appropriate evidence. For LLMs without specific training or fine - tuning, this is even more difficult. To address these challenges, the paper proposes a new framework named "Chain - of - Discussion", which improves the accuracy and comprehensiveness of answers through the interactive discussion among multiple open - source LLMs. Specifically, this framework encourages multiple LLMs to summarize, criticize, and correct each other's outputs, in order to reach a more evidence - based and practical answer. The main contributions of the paper include: 1. A high - quality complex evidence - based question - answering (CEBQA) dataset has been collected, which contains 200 carefully annotated legal consultation questions in the field of marriage and family affairs. 2. A new discussion - chain framework, namely summarize - criticize - revise, has been proposed, which utilizes the synergy among multiple open - source LLMs to generate more accurate and useful answers. 3. Through GPT - 4 - based and evidence - centered evaluations, it has been proven that this framework can help small LLMs benefit from each other and improve the overall quality, especially in terms of correctness and comprehensiveness.