Fine-grained style control in Transformer-based Text-to-speech Synthesis

Li-Wei Chen,Alexander Rudnicky
DOI: https://doi.org/10.48550/arXiv.2110.06306
2022-03-17
Abstract:In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.
Audio and Speech Processing,Computation and Language,Machine Learning,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to achieve fine - grained style control in the Transformer - based text - to - speech synthesis system (TransformerTTS). Specifically, the author aims to improve the deficiencies of the existing system by introducing a new architecture, so that the generated speech is not only natural and clear, but also can better convey the speaking style and avoid the content - leakage problem. ### Main Problem Analysis 1. **Content - leakage Problem**: - In the existing TTS systems, the model may incorrectly extract language information from the reference speech, resulting in a mismatch between the generated speech and the content of the input text. This phenomenon is called "content - leakage". To overcome this problem, this paper proposes a new method to separate and control the speaking style without encoding the language content. 2. **Fine - grained Style Control**: - The existing style control methods can usually only handle global styles or style embeddings of fixed lengths, and are unable to well capture and control the time - varying characteristics within sentences (such as speech rate, volume and prosody). This paper proposes a method based on Local Style Tokens (LST) to achieve more refined style control. 3. **Style Transfer in Multi - speaker Settings**: - In addition to the single - speaker scenario, this paper also explores the style transfer ability in multi - speaker settings, ensuring that the system can effectively transfer styles between different speakers. ### Solutions - **Local Style Tokens (LST)**: - Extract the time - series local style tokens (LST) in the reference speech, and fuse these style information with the text content through the cross - attention mechanism. The design of LST enables the model to capture more subtle style changes. - **Cross - attention Module**: - Replace the original content encoder and use cross - attention blocks to gradually incorporate the given style into the phoneme representation. This method provides better inductive bias and helps to gradually inject style information. - **Training Tricks**: - During the training process, randomly truncate LST to prevent the model from relying too much on the complete reference speech, thereby alleviating the content - leakage problem. In addition, use wav2vec 2.0 features to extract audio representations, further reducing the risk of content - leakage. Through these improvements, the system proposed in this paper performs excellently in terms of naturalness, comprehensibility and style transfer ability, and can effectively avoid the content - leakage problem.