M2QA: Multi-domain Multilingual Question Answering

Leon Engländer,Hannah Sterz,Clifton Poth,Jonas Pfeiffer,Ilia Kuznetsov,Iryna Gurevych
2024-07-01
Abstract:Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation. We witness 1) considerable performance variations across domain-language combinations within model classes and 2) considerable performance drops between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary. We make M2QA publicly available at <a class="link-external link-https" href="https://github.com/UKPLab/m2qa" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient generalization ability of natural language processing (NLP) systems across different languages and domains. Specifically, although Transformer models and large - language models (LLMs) have made remarkable progress in the NLP field, the generalization of these models on new domains and new languages remains an unsolved problem. The paper points out that the existing multilingual or multi - domain benchmark tests either contain only one language, or use machine - generated text, or are limited to specific application domains, which makes it difficult to objectively evaluate the effects of joint language and domain transfer methods. To fill this gap, the authors introduce M2QA, a multi - domain and multilingual question - answering benchmark dataset. M2QA includes 13,500 SQuAD 2.0 - style question - answering instances, covering three languages: German, Turkish, and Chinese, as well as three domains: product reviews, news, and creative writing. Through M2QA, the authors explore the performance of fine - tuned models and state - of - the - art LLMs across languages and domains, and study modular methods to adapt to new languages and domains. The main contributions of the paper include: 1. Creating a multi - domain and multilingual extractive question - answering benchmark dataset, covering three languages and three domains, with a total of 13,500 answerable and unanswerable question - answering instances. 2. Evaluating the baseline and transfer performance using a wide range of models and transfer techniques, including fully fine - tuned models, modular transfer learning, and LLMs. 3. Discovering that there are significant differences in transfer performance among different language - domain combinations. 4. Proposing an improved SQuAD 2.0 evaluation metric to be better suited for multilingual extractive question - answering. 5. The results show that modern LLMs perform far worse on their target language - domain pairs than on their source language - domain pairs, emphasizing the need for further research on methods for simultaneously transferring language - and domain - specific information.