Abstract:Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting.

FQuAD: French Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension

FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain

ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

Alloprof: a new French question-answer education dataset and its use in an information retrieval case study

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

QuALITY: Question Answering with Long Input Texts, Yes!

BanglaQuAD: A Bengali Open-domain Question Answering Dataset

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

A Vietnamese Dataset for Evaluating Machine Reading Comprehension

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

FAMMA: A Benchmark for Financial Domain Multilingual Multimodal Question Answering

GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning

NorQuAD: Norwegian Question Answering Dataset

Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering

QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia and Wikidata Translated by Native Speakers

KazQAD: Kazakh Open-Domain Question Answering Dataset

FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models

XQA: A Cross-lingual Open-domain Question Answering Dataset

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation