Abstract:Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fusion strategies to further improve the integration of speaker and content features. Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks demonstrate that our proposed method achieves a BLEU score improvement of 1.14 compared to SC-S2UT, along with significant enhancements in MOS and speaker similarity. Furthermore, our approach achieves translation quality comparable to traditional S2UT, with only a minimal increase of 0.04s per utterance in inference time, while maintaining high speaker similarity. These results validate the effectiveness of the proposed method.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: in the process of direct speech - to - speech translation (Speech - to - Speech Translation, S2ST), how to preserve the voice characteristics of the speaker. Specifically, traditional S2ST methods usually lose the unique voice characteristics of the source - language speaker, resulting in the generated target - language speech sounding like a generic voice rather than retaining the original speaker's voice. ### Specific Background of the Problem 1. **Limitations of Traditional Cascade Systems**: - Traditional S2ST systems adopt a cascade approach, that is, automatic speech recognition (ASR), machine translation (MT), and text - to - speech synthesis (TTS) are carried out in sequence. This structure is prone to error propagation and has a slow inference speed. 2. **Problems of End - to - End Systems**: - End - to - End (E2E) systems such as the Translatotron series have improved translation speed and accuracy, but still cannot well preserve the speaker's voice characteristics when generating target - language speech. 3. **Challenges of Discrete - Unit Approaches**: - The speech - to - discrete - unit translation (S2UT) framework generates target - language speech by mapping the source speech to discrete units. However, discrete units mainly capture content information and ignore the speaker's voice characteristics, resulting in the generated speech lacking individuality. ### Solutions Proposed in the Paper To overcome the above problems, this paper proposes the following improvement measures: 1. **Self - Supervised Pretraining**: - A self - supervised pretraining method is proposed to train the speaker adapter and the unit - to - mel structure respectively, in order to enhance the ability to extract speaker information. 2. **Utilization of Speaker Embedding**: - Use a pretrained cross - language speaker adapter (such as the GE2E model) to directly extract the speaker embedding vector, thereby preserving the speaker's voice characteristics during training and inference. 3. **Feature Fusion Strategies**: - Explore three different feature fusion methods (cross - attention, gated linear unit, additive feed - forward network) to effectively combine content features and speaker features and generate a higher - quality mel - spectrogram. ### Experimental Results The experimental results show that the proposed method has achieved significant improvements in multiple aspects: - **BLEU Score**: On the ES - EN and FR - EN tasks, the BLEU scores of the pretrained SC - S2UT model are 17.24 and 22.82 respectively, approaching or even exceeding the existing cascade systems. - **MOS Score**: In the naturalness evaluation, the MOS score of the pretrained SC - S2UT model reaches 3.35 ± 0.06 (ES - EN), which is better than the previous SC - S2UT model (3.26 ± 0.13). - **Speaker Similarity**: By calculating the cosine similarity between speaker embeddings, it is verified that the new method can better preserve the speaker's voice characteristics. - **Inference Time**: Although an additional speaker information processing module is introduced, the inference time is only increased by 0.04 seconds, maintaining high efficiency. In conclusion, by introducing techniques such as self - supervised pretraining and speaker embedding, this paper significantly improves the translation quality and speaker - voice - preservation effect of the S2ST system without significantly increasing the inference time.

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Multilingual Speech-to-Speech Translation into Multiple Target Languages

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

Speaker voice normalization for end-to-end speech translation

Improving Speech-to-Speech Translation Through Unlabeled Text

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Textless Speech-to-Speech Translation With Limited Parallel Data

Representation Purification for End-to-End Speech Translation

Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

Learning Semantic Information from Machine Translation to Improve Speech-to-Text Translation

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Learning Shared Semantic Space for Speech-to-Text Translation

Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation