Abstract:Direct speech-to-speech translation (S2ST) with discrete units leverages recent progress in speech representation learning. Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e.g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism. In this work, we propose TranSpeech, a speech-to-speech translation model with bilateral perturbation. To alleviate the acoustic multimodal problem, we propose bilateral perturbation (BiP), which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations. With reduced multimodality, we step forward and become the first to establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices and produces high-accuracy results in just a few cycles. Experimental results on three language pairs demonstrate that BiP yields an improvement of 2.9 BLEU on average compared with a baseline textless S2ST model. Moreover, our parallel decoding shows a significant reduction of inference latency, enabling speedup up to 21.4x than autoregressive technique. Audio samples are available at \url{https://TranSpeech.github.io/}

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

Divergence-Guided Simultaneous Speech Translation

M3ST: Mix at Three Levels for Speech Translation

Understanding and Bridging the Modality Gap for Speech Translation

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.

Improving speech translation by fusing speech and text

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Bridging the Modality Gap for Speech-to-Text Translation

CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Learning Shared Semantic Space for Speech-to-Text Translation

DUB: Discrete Unit Back-translation for Speech Translation

CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

Training Simultaneous Speech Translation with Robust and Random Wait-k-Tokens Strategy

Representation Purification for End-to-End Speech Translation

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Speech Sense Disambiguation: Tackling Homophone Ambiguity in End-to-End Speech Translation

Tuning Large language model for End-to-end Speech Translation

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Back Translation for Speech-to-text Translation Without Transcripts