Abstract:Pre-training on large-scale open-domain dialogue data can substantially improve the performance of dialogue models. However, the pre-trained dialogue model's ability to utilize long-range context is limited due to the scarcity of long-turn dialogue sessions. Most dialogues in existing pre-training corpora contain fewer than three turns of dialogue. To alleviate this issue, we propose the Retrieve, Reorganize and Rescale framework (Re$^3$Dial), which can automatically construct billion-scale long-turn dialogues by reorganizing existing short-turn ones. Given a short-turn session, Re$^3$Dial first employs a session retriever to retrieve coherent consecutive sessions. To this end, we train the retriever to capture semantic and discourse relations within multi-turn dialogues through contrastive training. Next, Re$^3$Dial samples a session from retrieved results following a diversity sampling strategy, which is designed to penalize repetitive or generic sessions. A longer session is then derived by concatenating the original session and the sampled session. By repeating the above process, Re$^3$Dial can yield a coherent long-turn dialogue. Extensive experiments on multiple multi-turn dialogue benchmarks demonstrate that Re$^3$Dial significantly improves the dialogue model's ability to utilize long-range context and thus generate more sensible and informative responses. Finally, we build a toolkit for efficiently rescaling conversations with Re$^3$Dial, which enables us to construct a corpus containing 1B Chinese dialogue sessions with 11.3 turns on average (5$\times$ longer than the original corpus). Our retriever model, code, and data is publicly available at \url{<a class="link-external link-https" href="https://github.com/thu-coai/Re3Dial" rel="external noopener nofollow">this https URL</a>}.

EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training

EVA2.0: Investigating Open-domain Chinese Dialogue Systems with Large-scale Pre-training

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

An Empirical Investigation of Pre-Trained Transformer Language Models for Open-Domain Dialogue Generation

A Large-Scale Chinese Short-Text Conversation Dataset

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset forE-commerce Customer Service

PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model

DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation

PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation

xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark

GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue Generation

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Re$^3$Dial: Retrieve, Reorganize and Rescale Dialogue Corpus for Long-Turn Open-Domain Dialogue Pre-training

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

XDailyDialog: A Multilingual Parallel Dialogue Corpus

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically Constructed from Live Streaming

Re3Dial: Retrieve, Reorganize and Rescale Conversations for Long-Turn Open-Domain Dialogue Pre-training