Debatts: Zero-Shot Debating Text-to-Speech Synthesis

Yiqiao Huang,Yuancheng Wang,Jiaqi Li,Haotian Guo,Haorui He,Shunsi Zhang,Zhizheng Wu
2024-11-11
Abstract:In debating, rebuttal is one of the most critical stages, where a speaker addresses the arguments presented by the opposing side. During this process, the speaker synthesizes their own persuasive articulation given the context from the opposing side. This work proposes a novel zero-shot text-to-speech synthesis system for rebuttal, namely Debatts. Debatts takes two speech prompts, one from the opposing side (i.e. opponent) and one from the speaker. The prompt from the opponent is supposed to provide debating style prosody, and the prompt from the speaker provides identity information. In particular, we pretrain the Debatts system from in-the-wild dataset, and integrate an additional reference encoder to take debating prompt for style. In addition, we also create a debating dataset to develop Debatts. In this setting, Debatts can generate a debating-style speech in rebuttal for any voices. Experimental results confirm the effectiveness of the proposed system in comparison with the classic zero-shot TTS systems.
Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of generating natural and persuasive rebuttal voices in debate scenarios. Specifically, the paper proposes a new zero - sample text - to - speech synthesis system (Debatts) for generating debate - style voices in the rebuttal phase. The following are the core problems and solutions in the paper: ### Core Problems 1. **Rebuttal Voice Generation in Debates**: - In the debate process, rebuttal is a crucial part. Debaters need to respond according to the opponent's arguments and express their own views in a persuasive way. - Existing TTS (text - to - speech) systems usually cannot well capture the intonation and emotional changes in debates, resulting in the generated voices lacking naturalness and persuasiveness. 2. **Combining Opponent Voice Styles**: - Rebuttal depends not only on the debater's own expression method but also on the opponent's voice style. How to incorporate the opponent's voice style into the rebuttal voice generation is a challenge. 3. **Zero - Sample Generation**: - Traditional TTS systems usually require a large amount of training data to generate the voices of specific speakers. In the debate scenario, it may be necessary to generate rebuttal voices for those who have never participated in debates, which requires the system to have zero - sample generation ability. ### Solutions 1. **Debatts System**: - The Debatts system generates rebuttal voices with a debate style by combining the opponent's voice as a style cue and the debater's voice as identity information. - The system adopts a two - stage model: the first stage predicts semantic tags, and the second stage generates acoustic tags and converts them into waveforms. 2. **Multi - Speaker Debate Dataset (Debatts - Data)**: - The paper creates a Chinese debate dataset containing rich intonations and meta - information to support the development and training of the system. - The dataset is sourced from formal Chinese debate recordings, covering more than 800 matches and a total of more than 1,200 hours of audio. 3. **Experimental Verification**: - Through objective and subjective evaluations, the effectiveness of the Debatts system in generating natural and debate - style rebuttal voices is verified. - The experimental results show that the Debatts system performs excellently in terms of style consistency and similarity, and can especially generate natural debate - style voices when dealing with the voices of non - debate speakers. ### Formula Representation In terms of formula representation, some key formulas involved in the paper are as follows: - **Probability Distribution of Semantic Tag Prediction**: \[ p(\hat{S}_{\text{spk}} | T, S_{\text{op}}; \theta_{\text{new}}^{\text{t2s}}) = \prod_{t = 1}^{N} p(\hat{S}_{\text{spk}, t} | S_{\text{spk}, <t}, T, S_{\text{op}}; \theta_{\text{new}}^{\text{t2s}}) \] where: - \( \hat{S}_{\text{spk}} \) represents the predicted target voice semantic tag sequence. - \( S_{\text{spk}} \) represents the semantic tag sequence of the target reference voice. - \( S_{\text{op}} \) represents the semantic tag sequence of the opponent's voice. - \( T \) represents the semantic tag sequence of the target text. - \( \theta_{\text{new}}^{\text{t2s}} \) represents the model parameters. With the support of these methods and datasets, the Debatts system can generate more natural and persuasive rebuttal voices in debate scenarios.