Abstract:Dialogue systems have been widely applied in many scenarios and are now more powerful and ubiquitous than ever before. With large neural models and massive available data, current dialogue systems have access to more knowledge than any people in their life. However, current dialogue systems still do not perform at a human level. One major gap between conversational agents and humans lies in their abilities to be aware of social norms. The development of socially-aware dialogue systems is impeded due to the lack of resources. In this paper, we present the first socially-aware dialogue corpus - SocialDial, based on Chinese social culture. SocialDial consists of two parts: 1,563 multi-turn dialogues between two human speakers with fine-grained labels, and 4,870 synthetic conversations generated by ChatGPT. The human corpus covers five categories of social norms, which have 14 sub-categories in total. Specifically, it contains social factor annotations including social relation, context, social distance, and social norms. However, collecting sufficient socially-aware dialogues is costly. Thus, we harness the power of ChatGPT and devise an ontology-based synthetic data generation framework. This framework is able to generate synthetic data at scale. To ensure the quality of synthetic dialogues, we design several mechanisms for quality control during data collection. Finally, we evaluate our dataset using several pre-trained models, such as BERT and RoBERTa. Comprehensive empirical results based on state-of-the-art neural models demonstrate that modeling of social norms for dialogue systems is a promising research direction. To the best of our knowledge, SocialDial is the first socially-aware dialogue dataset that covers multiple social factors and has fine-grained labels.

Interview: A Large-Scale Open-Source Corpus of Media Dialog

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

Audio Dialogues: Dialogues dataset for audio and music understanding

MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

XDailyDialog: A Multilingual Parallel Dialogue Corpus

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Advancing an Interdisciplinary Science of Conversation: Insights from a Large Multimodal Corpus of Human Speech

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset forE-commerce Customer Service

LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

A Large-Scale Chinese Short-Text Conversation Dataset

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

NewsInterview: a Dataset and a Playground to Evaluate LLMs' Ground Gap via Informational Interviews

SocialDial: A Benchmark for Socially-Aware Dialogue Systems

A Dataset for Sentence Retrieval for Open-Ended Dialogues

MedDialog: A Large-scale Medical Dialogue Dataset

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems