IslamicPCQA: A Dataset for Persian Multi-hop Complex Question Answering in Islamic Text Resources

Arash Ghafouri,Hasan Naderi,Mohammad Aghajani asl,Mahdi Firouzmandi
2023-04-23
Abstract:Nowadays, one of the main challenges for Question Answering Systems is to answer complex questions using various sources of information. Multi-hop questions are a type of complex questions that require multi-step reasoning to answer. In this article, the IslamicPCQA dataset is introduced. This is the first Persian dataset for answering complex questions based on non-structured information sources and consists of 12,282 question-answer pairs extracted from 9 Islamic encyclopedias. This dataset has been created inspired by the HotpotQA English dataset approach, which was customized to suit the complexities of the Persian language. Answering questions in this dataset requires more than one paragraph and reasoning. The questions are not limited to any prior knowledge base or ontology, and to provide robust reasoning ability, the dataset also includes supporting facts and key sentences. The prepared dataset covers a wide range of Islamic topics and aims to facilitate answering complex Persian questions within this subject matter
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Current question - answering systems perform poorly when dealing with complex problems, especially those involving multi - hop reasoning, particularly in languages with limited resources (such as Persian). To address this challenge, the authors created a dataset named IslamicPCQA, which is the first dataset for answering complex Persian questions based on unstructured information sources. Specifically, the paper mainly addresses the following aspects: 1. **Lack of a complex - question dataset suitable for Persian**: - Most of the current high - quality question - answering datasets are in English, and languages with limited resources such as Persian lack similar complex - question datasets. - This has led to poor performance of Persian question - answering systems, especially when dealing with multi - step reasoning problems. 2. **Constructing a dataset of multi - step reasoning problems**: - To improve the performance of Persian question - answering systems when facing complex, multi - step reasoning problems, the authors created a dataset containing 12,282 question - answer pairs. - These questions are from 9 Islamic encyclopedias, covering a wide range of Islamic topics, and each question requires multiple paragraphs and reasoning steps to answer. 3. **Ensuring the quality and diversity of the dataset**: - The dataset includes not only questions and answers but also supporting facts and key sentences to ensure that the answers to the questions are robust and reliable. - The question types in the dataset are diverse, including comparative questions, bridging questions, etc., which all require multi - step reasoning to obtain the answers. 4. **Adapting to the complexity of Persian**: - The design of the dataset takes into account the characteristics of Persian, such as grammatical structure and vocabulary usage, to ensure that the generated questions and answers conform to the actual use of Persian. By solving these problems, the authors hope to promote the development of Persian question - answering systems, enabling them to better handle complex multi - step reasoning problems, thereby enhancing user experience and system performance. ### Formula Representation There are no specific mathematical formulas involved in the paper, but to ensure correct formatting, if relevant formulas need to be represented in the future, the following Markdown format can be used: ```markdown $$ E = mc^2 $$ ``` This will ensure that the formulas are correctly displayed in the document.