Enhancing Speech-to-Speech Translation with Multiple TTS Targets

Jiatong Shi,Yun Tang,Ann Lee,Hirofumi Inaguma,Changhan Wang,Juan Pino,Shinji Watanabe

2023-04-10

Abstract:It has been known that direct speech-to-speech translation (S2ST) models usually suffer from the data scarcity issue because of the limited existing parallel materials for both source and target speech. Therefore to train a direct S2ST system, previous works usually utilize text-to-speech (TTS) systems to generate samples in the target language by augmenting the data from speech-to-text translation (S2TT). However, there is a limited investigation into how the synthesized target speech would affect the S2ST models. In this work, we analyze the effect of changing synthesized target speech for direct S2ST models. We find that simply combining the target speech from different TTS systems can potentially improve the S2ST performances. Following that, we also propose a multi-task framework that jointly optimizes the S2ST system with multiple targets from different TTS systems. Extensive experiments demonstrate that our proposed framework achieves consistent improvements (2.8 BLEU) over the baselines on the Fisher Spanish-English dataset.

Sound,Computation and Language,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of training data in the direct speech - to - speech translation (S2ST) model due to the scarcity of parallel corpora. Specifically, existing S2ST models usually perform poorly because of the lack of sufficient parallel speech materials in the source and target languages. To overcome this problem, previous studies usually use text - to - speech (TTS) systems to generate samples in the target language and increase the training data through data augmentation methods from speech - to - text translation (S2TT). However, these studies rarely explore how different synthesized target voices affect the performance of the S2ST model. In response to the above problems, the authors of this paper analyzed the impact of changing the synthesized target voice on the direct S2ST model and found that simply combining target voices from different TTS systems can potentially improve the performance of the S2ST model. Based on this finding, the authors further proposed a multi - task framework that jointly optimizes the S2ST system with multiple targets from different TTS systems. The experimental results show that the proposed framework significantly improves the performance of the baseline model on the Fisher Spanish - English dataset, with a 2.8 - point increase in the BLEU score. In short, the core problem of this paper is to explore the impact of target voices synthesized by different TTS systems on the performance of the S2ST model and propose an effective solution to improve the performance of the S2ST model.

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

Multilingual Speech-to-Speech Translation into Multiple Target Languages

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Improving Speech-to-Speech Translation Through Unlabeled Text

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation

Improving speech translation by fusing speech and text

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Textless Speech-to-Speech Translation With Limited Parallel Data

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning