Abstract:Speech-to-speech translation (S2ST) enables spoken communication between people talking in different languages. Despite a few studies on multilingual S2ST, their focus is the multilinguality on the source side, i.e., the translation from multiple source languages to one target language. We present the first work on multilingual S2ST supporting multiple target languages. Leveraging recent advance in direct S2ST with speech-to-unit and vocoder, we equip these key components with multilingual capability. Speech-to-masked-unit (S2MU) is the multilingual extension of S2U, which applies masking to units which don't belong to the given target language to reduce the language interference. We also propose multilingual vocoder which is trained with language embedding and the auxiliary loss of language identification. On benchmark translation testsets, our proposed multilingual model shows superior performance than bilingual models in the translation from English into $16$ target languages.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of supporting multiple target languages in multilingual speech - to - speech translation (S2ST). Specifically, existing multilingual S2ST research mainly focuses on translation from multiple source languages to a single target language, and this paper is the first attempt to build a multilingual S2ST system that can translate from one source language (such as English) into multiple target languages. To achieve this goal, the authors propose a new method, that is, on the basis of the direct S2ST framework, key components such as the speech - to - unit (S2U) module and the vocoder are extended by introducing multilingual capabilities.
### Main contributions:
1. **Speech - to - Masked - Unit (S2MU) model**:
- By applying the masking technique in the unit sequence, interference between different languages is reduced and translation performance is improved.
- The masking technique helps the model focus on the units of a given target language and avoid interference from other languages.
2. **Multilingual vocoder**:
- Language embedding and language identification auxiliary losses are introduced to reduce language interference in multilingual synthesis.
- Through these improvements, the multilingual vocoder can synthesize high - quality speech in multiple similar language families, reducing the number of required vocoders.
### Experimental results:
- The **multilingual S2ST model** shows better performance than the bilingual model in the task of translating from English into 16 target languages. The average BLEU scores on in - domain data and out - of - domain data are increased by +5.2 and +2.7 respectively.
- The **multilingual vocoder** also outperforms the monolingual vocoder in synthesis quality, especially in the high - resource language direction, where the improvement in BLEU score is more significant.
### Analysis and discussion:
- The multilingual S2MU model outperforms the Textless model in most language directions, but performs poorly in Slavic languages (such as Croatian, Slovak and Slovenian) with very limited resources.
- The matching of model capacity and language resource size is crucial for translation performance. For high - resource languages, large - capacity models (such as S2U) perform well; while for low - resource languages, small - capacity models (such as Textless) are better.
- In the extremely low - resource language directions (such as Estonian, Finnish and Lithuanian), all models perform unsatisfactorily.
- The data domain also has an important impact on translation performance, and the performance of the model on in - domain data is usually better than that on out - of - domain data.
### Future work:
- Explore multilingual data sampling strategies to address the problem of unbalanced training data.
- Use data - driven methods to optimize language grouping to further promote cross - language transfer and reduce language interference.
In conclusion, this paper has made important progress in the field of multilingual S2ST. In particular, by introducing multilingual capabilities and an improved vocoder design, the performance of translating from one source language into multiple target languages has been significantly improved.