Abstract:Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fusion strategies to further improve the integration of speaker and content features. Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks demonstrate that our proposed method achieves a BLEU score improvement of 1.14 compared to SC-S2UT, along with significant enhancements in MOS and speaker similarity. Furthermore, our approach achieves translation quality comparable to traditional S2UT, with only a minimal increase of 0.04s per utterance in inference time, while maintaining high speaker similarity. These results validate the effectiveness of the proposed method.

Pretreatment for Speech Machine Translation

Preprocessing Improvement in Mime Speech Recognition based on Surface Electromyogram

The Impact of ASR on Speech-to-Speech Translation Performance.

Text-conditioned Transformer for Automatic Pronunciation Error Detection

Improving the Robustness of Speech Translation

A Study of Pre-editing Methods at the Lexical Level in the Process of Machine Translation

Automatic Speech Recognition Post-Processing for Readability: Task, Dataset and a Two-Stage Pre-Trained Approach

The MSRA Machine Translation System for IWSLT 2010.

Streaming Punctuation for Long-form Dictation with Transformers

Segmenting Subtitles for Correcting ASR Segmentation Errors

Fine Grained Human Evaluation for English-to-Chinese Machine Translation: A Case Study on Scientific Text

Pre-Translation for Neural Machine Translation

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation

Subtitles to Segmentation: Improving Low-Resource Speech-to-Text Translation Pipelines

An Unified and Automatic Approach of Mandarin HTS System.

Improvement of Korean-Chinese Machine Translation Based on Complex Sentence Deconstruction

Bring More Attention to Syntactic Symmetry for Automatic Postediting of High-Quality Machine Translations

Representation Purification for End-to-End Speech Translation

Improved Long-Form Spoken Language Translation with Large Language Models

Understanding and Bridging the Modality Gap for Speech Translation