ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Hancheng Zhuang,Yinlin Guo,Yuehai Wang
DOI: https://doi.org/10.1109/auteee60196.2023.10408683
2023-01-01
Abstract:In the field of multilingual Text-to-Speech (TTS), current methods, particularly those using Grapheme-to-Phoneme (G2P) conversions or character-level vocabs, often face challenges such as increased computational demands or higher error rates. To address these, we propose a VITS-based model, ViSPer, utilizing Byte-Pair Encoding (BPE) to create subword-level vocabs, simplifying text processing while enhancing pronunciation accuracy. To improve the expressiveness of synthesized speech, we further introduce a deep feature loss with features extracted by a multilingual ASR model, Whisper. Experimental results show that our approach outperforms the baseline system in both voice quality and naturalness of the synthesized speech in terms of subjective and objective evaluation metrics. Ablation studies demonstrate that each design is effective.
What problem does this paper attempt to address?