Abstract:In debating, rebuttal is one of the most critical stages, where a speaker addresses the arguments presented by the opposing side. During this process, the speaker synthesizes their own persuasive articulation given the context from the opposing side. This work proposes a novel zero-shot text-to-speech synthesis system for rebuttal, namely Debatts. Debatts takes two speech prompts, one from the opposing side (i.e. opponent) and one from the speaker. The prompt from the opponent is supposed to provide debating style prosody, and the prompt from the speaker provides identity information. In particular, we pretrain the Debatts system from in-the-wild dataset, and integrate an additional reference encoder to take debating prompt for style. In addition, we also create a debating dataset to develop Debatts. In this setting, Debatts can generate a debating-style speech in rebuttal for any voices. Experimental results confirm the effectiveness of the proposed system in comparison with the classic zero-shot TTS systems.

What problem does this paper attempt to address?

This paper attempts to solve the problem of generating natural and persuasive rebuttal voices in debate scenarios. Specifically, the paper proposes a new zero - sample text - to - speech synthesis system (Debatts) for generating debate - style voices in the rebuttal phase. The following are the core problems and solutions in the paper: ### Core Problems 1. **Rebuttal Voice Generation in Debates**: - In the debate process, rebuttal is a crucial part. Debaters need to respond according to the opponent's arguments and express their own views in a persuasive way. - Existing TTS (text - to - speech) systems usually cannot well capture the intonation and emotional changes in debates, resulting in the generated voices lacking naturalness and persuasiveness. 2. **Combining Opponent Voice Styles**: - Rebuttal depends not only on the debater's own expression method but also on the opponent's voice style. How to incorporate the opponent's voice style into the rebuttal voice generation is a challenge. 3. **Zero - Sample Generation**: - Traditional TTS systems usually require a large amount of training data to generate the voices of specific speakers. In the debate scenario, it may be necessary to generate rebuttal voices for those who have never participated in debates, which requires the system to have zero - sample generation ability. ### Solutions 1. **Debatts System**: - The Debatts system generates rebuttal voices with a debate style by combining the opponent's voice as a style cue and the debater's voice as identity information. - The system adopts a two - stage model: the first stage predicts semantic tags, and the second stage generates acoustic tags and converts them into waveforms. 2. **Multi - Speaker Debate Dataset (Debatts - Data)**: - The paper creates a Chinese debate dataset containing rich intonations and meta - information to support the development and training of the system. - The dataset is sourced from formal Chinese debate recordings, covering more than 800 matches and a total of more than 1,200 hours of audio. 3. **Experimental Verification**: - Through objective and subjective evaluations, the effectiveness of the Debatts system in generating natural and debate - style rebuttal voices is verified. - The experimental results show that the Debatts system performs excellently in terms of style consistency and similarity, and can especially generate natural debate - style voices when dealing with the voices of non - debate speakers. ### Formula Representation In terms of formula representation, some key formulas involved in the paper are as follows: - **Probability Distribution of Semantic Tag Prediction**: \[ p(\hat{S}_{\text{spk}} | T, S_{\text{op}}; \theta_{\text{new}}^{\text{t2s}}) = \prod_{t = 1}^{N} p(\hat{S}_{\text{spk}, t} | S_{\text{spk}, <t}, T, S_{\text{op}}; \theta_{\text{new}}^{\text{t2s}}) \] where: - \( \hat{S}_{\text{spk}} \) represents the predicted target voice semantic tag sequence. - \( S_{\text{spk}} \) represents the semantic tag sequence of the target reference voice. - \( S_{\text{op}} \) represents the semantic tag sequence of the opponent's voice. - \( T \) represents the semantic tag sequence of the target text. - \( \theta_{\text{new}}^{\text{t2s}} \) represents the model parameters. With the support of these methods and datasets, the Debatts system can generate more natural and persuasive rebuttal voices in debate scenarios.

Debatts: Zero-Shot Debating Text-to-Speech Synthesis

Towards Effective Rebuttal: Listening Comprehension using Corpus-Wide Claim Mining

A Recorded Debating Dataset

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

FlashSpeech: Efficient Zero-Shot Speech Synthesis

PRESENT: Zero-Shot Text-to-Prosody Control

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Intelli-Z: Toward Intelligible Zero-Shot TTS

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

The DeepZen Speech Synthesis System for Blizzard Challenge 2023

DelightfulTTS: the Microsoft Speech Synthesis System for Blizzard Challenge 2021

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis