Automatically Annotate TV Series Subtitles for Dialogue Corpus Construction

Leilan Zhang,Qiang Zhou
DOI: https://doi.org/10.1109/apsipaasc47483.2019.9023129
2019-01-01
Abstract:In recent years, the scarcity of dialogue corpus is becoming the bottleneck of Chinese dialogue generation systems. Although subtitles provide favorable material to construct dialogue corpus because of their abundance and diversity, lacking speaker information makes it hard to extract dialogues from subtitles directly. To utilize these resources, we proposed an improved method to automatically annotate bilingual TV subtitles with speaker and scene tags using their corresponding scripts. First, tags of speakers and scene boundaries in the scripts are mapped to the subtitles through an information retrieval method. Then, the mapping errors are detected with a convolutional network and corrected by heuristic strategies to improve the annotation quality. We applied this method on 779 bilingual subtitle files of 4 TV series and obtained a Chinese dialogue corpus Tv4Dialog 1 1 It is publicly available at https://github.com/zll17/TV4Dialog containing 260674 utterances. Experiment result shows that our method can achieve an accuracy of 94.62% on speaker tag annotation, improving nearly 12% on the previous state-of-the-art result.
What problem does this paper attempt to address?