NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation

Xiaoyang Wang,Chen Li,Jianqiao Zhao,Dong Yu
DOI: https://doi.org/10.1609/aaai.v35i16.17649
2024-11-07
Abstract:In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv, which allows the participants to chat anything they want as long as any element from the topic is mentioned and the topic shift is smooth. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1. These conversations contain in-depth discussions on related topics or widely natural transition between multiple topics. We believe either way is normal for human conversation. To facilitate the research on this corpus, we provide results of several benchmark models. Comparative results show that for this dataset, our current models are not able to provide significant improvement by introducing background knowledge/topic. Therefore, the proposed dataset should be a good benchmark for further research to evaluate the validity and naturalness of multi-turn conversation systems. Our dataset is available at <a class="link-external link-https" href="https://ailab.tencent.com/ailab/nlp/dialogue/#datasets" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on several key challenges existing in current open - domain dialogue systems: 1. **Lack of naturalness**: Although existing open - domain dialogue systems can generate general responses, these responses often lack meaningful information and cannot conduct in - depth or natural conversations like humans. This is mainly because current models have difficulty maintaining the coherence and naturalness of the dialogue when handling multi - turn dialogues. 2. **Insufficient utilization of background knowledge**: Although some studies attempt to improve the quality of dialogue by introducing background knowledge (such as knowledge graphs, documents, etc.), in practical applications, how to effectively integrate this background knowledge into the dialogue generation process remains an unsolved problem. Many existing datasets require participants to only have conversations within a given topic and assume that the participants are already familiar with the provided documents or knowledge graphs, which is inconsistent with the way of conversation in real life. 3. **Lack of dialogue scenarios**: Human conversations in real life usually occur in specific scenarios, rather than just discussing around a certain topic. However, existing dialogue datasets often overlook this point, resulting in the generated dialogues lacking a sense of authenticity. To solve the above problems, the paper proposes a new Chinese multi - turn topic - driven dialogue dataset - NaturalConv. This dataset has the following characteristics: - **Naturalness**: Participants are allowed to freely expand topics in the dialogue as long as any information in the news article is mentioned and the topic transition is natural. In addition, participants are also allowed to have small talk and greetings, making the dialogue closer to the real communication scenario. - **Scenario setting**: Participants are required to assume a dialogue scenario, such as a conversation between two students before class. This setting makes the dialogue more vivid and specific. - **Rich content**: The dataset contains 19,900 dialogues and 400,000 dialogue turns from six domains, with an average of 20.1 turns per dialogue. These dialogues cover in - depth discussions of related topics and natural transitions between multiple topics. Through these designs, NaturalConv aims to provide a dataset that is closer to real - human conversations, thereby promoting research on the naturalness and effectiveness of dialogue systems.