A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

Suman Banerjee,Nikita Moghe,Siddhartha Arora,Mitesh M. Khapra

DOI: https://doi.org/10.48550/arXiv.1806.05997

2018-06-15

Abstract:There is an increasing demand for goal-oriented conversation systems which can assist users in various day-to-day activities such as booking tickets, restaurant reservations, shopping, etc. Most of the existing datasets for building such conversation systems focus on monolingual conversations and there is hardly any work on multilingual and/or code-mixed conversations. Such datasets and systems thus do not cater to the multilingual regions of the world, such as India, where it is very common for people to speak more than one language and seamlessly switch between them resulting in code-mixed conversations. For example, a Hindi speaking user looking to book a restaurant would typically ask, "Kya tum is restaurant mein ek table book karne mein meri help karoge?" ("Can you help me in booking a table at this restaurant?"). To facilitate the development of such code-mixed conversation models, we build a goal-oriented dialog dataset containing code-mixed conversations. Specifically, we take the text from the DSTC2 restaurant reservation dataset and create code-mixed versions of it in Hindi-English, Bengali-English, Gujarati-English and Tamil-English. We also establish initial baselines on this dataset using existing state of the art models. This dataset along with our baseline implementations is made publicly available for research purposes.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in multilingual regions (such as India), the existing dialogue system datasets and systems mainly focus on monolingual dialogues and overlook multilingual or code - mixed dialogues. This neglect results in these systems being unable to effectively serve those who use multiple languages naturally in daily communication. For example, in India, people often mix Hindi and English or Bengali and English when talking. Therefore, the goal of the paper is to construct a dataset containing code - mixed dialogues to promote the development of dialogue systems that can handle such dialogues. Specifically, the authors selected texts from the DSTC2 restaurant reservation dataset and created code - mixed versions of Hindi - English, Bengali - English, Gujarati - English and Tamil - English. In addition, they also used the existing state - of - the - art models to establish preliminary baseline performance on such datasets and publicly released the datasets and their baseline implementations for research purposes. This marks the birth of the first dataset containing code - mixed dialogues and is expected to promote further research in this field.

A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

Towards Exploiting Background Knowledge for Building Conversation Systems

MG-ShopDial: A Multi-Goal Conversational Dataset for e-Commerce

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

MD3: The Multi-Dialect Dataset of Dialogues

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue

XDailyDialog: A Multilingual Parallel Dialogue Corpus

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures

Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Large scale annotated dataset for code-mix abusive short noisy text

OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis