GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems

Bosheng Ding,Junjie Hu,Lidong Bing,Sharifah Mahani Aljunied,Shafiq Joty,Luo Si,Chunyan Miao
DOI: https://doi.org/10.48550/arXiv.2110.07679
2022-04-01
Abstract:Much recent progress in task-oriented dialogue (ToD) systems has been driven by available annotation data across multiple domains for training. Over the last few years, there has been a move towards data curation for multilingual ToD systems that are applicable to serve people speaking different languages. However, existing multilingual ToD datasets either have a limited coverage of languages due to the high cost of data curation, or ignore the fact that dialogue entities barely exist in countries speaking these languages. To tackle these limitations, we introduce a novel data curation method that generates GlobalWoZ -- a large-scale multilingual ToD dataset globalized from an English ToD dataset for three unexplored use cases. Our method is based on translating dialogue templates and filling them with local entities in the target-language countries. We release our dataset as well as a set of strong baselines to encourage research on learning multilingual ToD systems for real use cases.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve two main limitations in the dataset construction of multilingual task - oriented dialogue systems (ToD): 1. **Limited language coverage**: Existing multilingual ToD datasets can usually cover only a few languages due to the high cost of data collection. 2. **Ignoring the existence of local entities**: When translating English ToD datasets, existing methods simply translate English named entities (such as place names, restaurant names) into the target language, ignoring the fact that these entities hardly exist in the target - language countries. To solve these problems, the paper proposes a new dataset construction method and generates a large - scale multilingual ToD dataset named GlobalWoZ. This method is achieved by translating dialogue templates and filling in local entities in the target - language countries, thus supporting three unexplored multilingual ToD usage scenarios: - **F&F**: Foreign - language speakers use the ToD system in a foreign - language - speaking country. - **F&E**: Foreign - language speakers use the ToD system in an English - speaking country. - **E&F**: English speakers use the ToD system in a foreign - language - speaking country. In addition, the paper also explores the prevalence of code - switching phenomena in cross - language and cross - country task - oriented dialogues, and experimentally demonstrates the deficiencies of current multilingual models in zero - shot cross - language transfer tasks. To improve the performance of the model, the paper proposes a series of data augmentation methods to train stronger baseline models.