Abstract:In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv, which allows the participants to chat anything they want as long as any element from the topic is mentioned and the topic shift is smooth. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1. These conversations contain in-depth discussions on related topics or widely natural transition between multiple topics. We believe either way is normal for human conversation. To facilitate the research on this corpus, we provide results of several benchmark models. Comparative results show that for this dataset, our current models are not able to provide significant improvement by introducing background knowledge/topic. Therefore, the proposed dataset should be a good benchmark for further research to evaluate the validity and naturalness of multi-turn conversation systems. Our dataset is available at <a class="link-external link-https" href="https://ailab.tencent.com/ailab/nlp/dialogue/#datasets" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key challenges existing in current open - domain dialogue systems: 1. **Lack of naturalness**: Although existing open - domain dialogue systems can generate general responses, these responses often lack meaningful information and cannot conduct in - depth or natural conversations like humans. This is mainly because current models have difficulty maintaining the coherence and naturalness of the dialogue when handling multi - turn dialogues. 2. **Insufficient utilization of background knowledge**: Although some studies attempt to improve the quality of dialogue by introducing background knowledge (such as knowledge graphs, documents, etc.), in practical applications, how to effectively integrate this background knowledge into the dialogue generation process remains an unsolved problem. Many existing datasets require participants to only have conversations within a given topic and assume that the participants are already familiar with the provided documents or knowledge graphs, which is inconsistent with the way of conversation in real life. 3. **Lack of dialogue scenarios**: Human conversations in real life usually occur in specific scenarios, rather than just discussing around a certain topic. However, existing dialogue datasets often overlook this point, resulting in the generated dialogues lacking a sense of authenticity. To solve the above problems, the paper proposes a new Chinese multi - turn topic - driven dialogue dataset - NaturalConv. This dataset has the following characteristics: - **Naturalness**: Participants are allowed to freely expand topics in the dialogue as long as any information in the news article is mentioned and the topic transition is natural. In addition, participants are also allowed to have small talk and greetings, making the dialogue closer to the real communication scenario. - **Scenario setting**: Participants are required to assume a dialogue scenario, such as a conversation between two students before class. This setting makes the dialogue more vivid and specific. - **Rich content**: The dataset contains 19,900 dialogues and 400,000 dialogue turns from six domains, with an average of 20.1 turns per dialogue. These dialogues cover in - depth discussions of related topics and natural transitions between multiple topics. Through these designs, NaturalConv aims to provide a dataset that is closer to real - human conversations, thereby promoting research on the naturalness and effectiveness of dialogue systems.

NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation

KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

MMConv: An Environment for Multimodal Conversational Search across Multiple Domains

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

Proactive Human-Machine Conversation with Explicit Conversation Goals

Fostering Natural Conversation in Large Language Models with NICO: a Natural Interactive COnversation dataset

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset forE-commerce Customer Service

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

CNAMD Corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agents

Conversation Chronicles: Towards Diverse Temporal and Relational Dynamics in Multi-Session Conversations

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

A Large-Scale Chinese Short-Text Conversation Dataset

A Dataset for Research on Short-Text Conversations.

Exploring Effective Information Utilization in Multi-Turn Topic-Driven Conversations

An Empirical Study On Deep Neural Network Models For Chinese Dialogue Generation

ConvNTM: Conversational Neural Topic Model.

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

Overview of the NLPCC 2018 Shared Task: Multi-turn Human-Computer Conversations.

ConvSearch: A Open-Domain Conversational Search Behavior Dataset