Abstract:The ability to ask questions is important in both human and machine intelligence. Learning to ask questions helps knowledge acquisition, improves question-answering and machine reading comprehension tasks, and helps a chatbot to keep the conversation flowing with a human. Existing question generation models are ineffective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neural question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia.

Natural Questions: A Benchmark for Question Answering Research

Would You Ask it that Way? Measuring and Improving Question Naturalness for Knowledge Graph Question Answering

SelQA: A New Benchmark for Selection-based Question Answering

Towards Automatic Generation of Questions from Long Answers

What is in the KGQA Benchmark Datasets? Survey on Challenges in Datasets for Question Answering on Knowledge Graphs

QAMPARI: An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs

Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents

Quda: Natural Language Queries for Visual Data Analytics

Modern Question Answering Datasets and Benchmarks: A Survey

QuALITY: Question Answering with Long Input Texts, Yes!

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation

Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices

Identifying Well-formed Natural Language Questions

KBQA: Learning Question Answering over QA Corpora and Knowledge Bases

Analyzing Human Questioning Behavior and Causal Curiosity through Natural Queries

Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus

Question-Answering of UGC

CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems

NewsQs: Multi-Source Question Generation for the Inquiring Mind

Generating Biomedical Question Answering Corpora from Q&A forums