Enhancing Speech-to-Speech Translation with Multiple TTS Targets

Jiatong Shi,Yun Tang,Ann Lee,Hirofumi Inaguma,Changhan Wang,Juan Pino,Shinji Watanabe
2023-04-10
Abstract:It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech. Therefore to train a direct S2ST system, previous works usually utilize text-to-speech (TTS) systems to generate samples in the target language by augmenting the data from speech-to-text translation (S2TT). However, there is a limited investigation into how the synthesized target speech would affect the S2ST models. In this work, we analyze the effect of changing synthesized target speech for direct S2ST models. We find that simply combining the target speech from different TTS systems can potentially improve the S2ST performances. Following that, we also propose a multi-task framework that jointly optimizes the S2ST system with multiple targets from different TTS systems. Extensive experiments demonstrate that our proposed framework achieves consistent improvements (2.8 BLEU) over the baselines on the Fisher Spanish-English dataset.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of training data in the direct speech - to - speech translation (S2ST) model due to the scarcity of parallel corpora. Specifically, existing S2ST models usually perform poorly because of the lack of sufficient parallel speech materials in the source and target languages. To overcome this problem, previous studies usually use text - to - speech (TTS) systems to generate samples in the target language and increase the training data through data augmentation methods from speech - to - text translation (S2TT). However, these studies rarely explore how different synthesized target voices affect the performance of the S2ST model. In response to the above problems, the authors of this paper analyzed the impact of changing the synthesized target voice on the direct S2ST model and found that simply combining target voices from different TTS systems can potentially improve the performance of the S2ST model. Based on this finding, the authors further proposed a multi - task framework that jointly optimizes the S2ST system with multiple targets from different TTS systems. The experimental results show that the proposed framework significantly improves the performance of the baseline model on the Fisher Spanish - English dataset, with a 2.8 - point increase in the BLEU score. In short, the core problem of this paper is to explore the impact of target voices synthesized by different TTS systems on the performance of the S2ST model and propose an effective solution to improve the performance of the S2ST model.