Speaker Diarization of Scripted Audiovisual Content

Yogesh Virkar,Brian Thompson,Rohit Paturi,Sundararajan Srinivasan,Marcello Federico
2023-08-04
Abstract:The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of accurately creating verbatim scripts (i.e., broadcast scripts) for dialogues in movies or TV shows during the media localization process. Specifically, this problem can be broken down into the following aspects: 1. **Multi-speaker tracking**: Existing speaker diarization models struggle to effectively track a large number of speakers when dealing with large audio files, such as TV shows. 2. **Frequent speaker change detection**: These models have low accuracy in detecting frequent speaker changes, which is particularly common in TV shows. To tackle these issues, the authors propose a new method that utilizes production scripts used during filming to extract pseudo-labeled data, thereby improving the speaker diarization task. Through this approach, the authors aim to enhance the accuracy and efficiency of speaker diarization, especially in scenarios with multiple speakers and frequent speaker changes. The main contributions of the paper include: - Proposing a new semi-supervised method that leverages information from production scripts as pseudo-labels to enhance the performance of speaker diarization. - Conducting experiments on a test set containing 66 episodes of TV shows, with results showing that this method significantly improves performance over two unsupervised baseline models across multiple evaluation metrics, with a relative improvement of 51.7%.