Creating a Japanese Dialogue Corpus with Multi-level Topic Analysis

Yuma Komoto,Xin Kang,Fuji Ren
DOI: https://doi.org/10.1109/ICNLP55136.2022.00065
2022-01-01
Abstract:The study of generative dialogue systems has become a hotspot with the recently well-studied natural language understanding and generation techniques. However, most works have been focusing on those widely used languages, such as English and Chinese, with huge dialogue corpora being collected and thoroughly analyzed for training the dialogue generation models. In addition, retaining the human-like diversity and spontaneity of speech in these corpora is challenging for building dialogue systems. In this paper, we propose a method to build a lage Japanese dialogue corpus by using the conversations posted on Twitter and to annotate the dialogue- and utterance-level topic labels and the corresponding probabilistic scores automatically by analyzing the similarity between the word clusters and the dialogue clusters of the corpus in the same semantic space. The advantage of this method is that it does not require the expensive time and effort of human workers for mimicking dialogues and annotating labels, which is specifically useful for those less widely used languages, such as Japanese. We compare four filtering settings with respect to the lower-bound of utterance length for corpus creation and topic annotation and report the effect of the utterance length to the quality of our dialogue corpus, based on a manual evaluation. Based on this corpus, we further propose two topic-based dialogue generation tasks, that is, the next-response-topic prediction task and the next-topic-based response generation task. The Japanese dialogue corpus is available on GitHub <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
What problem does this paper attempt to address?