ELQA: A Corpus of Metalinguistic Questions and Answers about English

Shabnam Behzad,Keisuke Sakaguchi,Nathan Schneider,Amir Zeldes
2023-07-04
Abstract:We present ELQA, a corpus of questions and answers in and about the English language. Collected from two online forums, the >70k questions (from English learners and others) cover wide-ranging topics including grammar, meaning, fluency, and etymology. The answers include descriptions of general properties of English vocabulary and grammar as well as explanations about specific (correct and incorrect) usage examples. Unlike most NLP datasets, this corpus is metalinguistic -- it consists of language about language. As such, it can facilitate investigations of the metalinguistic capabilities of NLU models, as well as educational applications in the language learning domain. To study this, we define a free-form question answering task on our dataset and conduct evaluations on multiple LLMs (Large Language Models) to analyze their capacity to generate metalinguistic answers.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to construct a corpus of metalinguistic questions and answers regarding the English language, and use this corpus to evaluate the performance of the current state - of - the - art natural language processing (NLP) techniques in free - form English language question - answering tasks. Specifically, the goals of the paper include: 1. **Release the first publicly available English metalinguistic question - answering dataset**: This dataset focuses on English and contains more than 70,000 questions and their answers collected from two Stack Exchange forums (the English Language & Usage forum and the English Learners forum), covering multiple topics such as grammar, meaning, fluency, and etymology. 2. **Propose the classification and analysis of questions in the corpus**: The paper classifies the questions in the corpus, including fluency, form - to - meaning interpretation, meaning - to - form encoding, grammatical analysis, and other types of questions, and provides detailed descriptions and analyses of these classifications. 3. **Study the answering ability of large - language models (LLMs) to these metalinguistic questions**: By defining a free - form question - answering task, the paper evaluates the abilities of multiple large - language models (such as T5 and GPT - 3) in generating metalinguistic answers. The research finds that although most models score relatively high in terms of the correctness of language structures, they are significantly lower than human - written answers in terms of the effectiveness of answers, indicating that this type of metalinguistic question - answering task remains a challenge for large - language models. Through these goals, the paper aims to promote research on the metalinguistic abilities of natural - language - understanding models and explore the potential applications of these models in the field of language - learning education.